GraceBako commented 3 years ago

Dear Prof Nagarajan, I am Grace Nabakooza, a Ph.D. student working on influenza in Africa. Currently, I am writing a manuscript on viral reassortment and using your super tool (GiRaF) to detect reassortants. GiRaF runs perfectly for genomes less than 600. However, in one of my runs, I have 1,200 genomes for which I am trying to infer reassortants based on all the 28 gene pairs. I plan to repeat this run 20 times to ensure the reassortants have high support based on both the GiRaF confidence parameter and the frequency (appearing in more than half of the 20 repeat runs). Unfortunately, when I set all the 20 repeat runs on the server they are very slow. I am using a Linux server where I am entitled to only two nodes having 30 CPUs each. Could you please advise me on how to speed up or allow parallelization of the GiRaF algorithm? I am looking forward to your response. Happy 2021! Grace

nnnagara commented 3 years ago

Dear Grace,

Thank you for your query. Do you observe convergence of your MCMC runs? If you do, then I do not think you need 20 separate runs to confirm support for reassortment events.

Carl – do you have any additional thoughts here?

Regards,

Niranjan

From: GraceBakoUg notifications@github.com Reply-To: CSB5/GiRaF reply@reply.github.com Date: Friday, 8 January 2021 at 3:08 PM To: CSB5/GiRaF GiRaF@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [CSB5/GiRaF] Help on parallelisation (#1)

Dear Prof Nagarajan, I am Grace Nabakooza, a Ph.D. student working on influenza in Africa. Currently, I am writing a manuscript on viral reassortment and using your super tool (GiRaF) to detect reassortants. GiRaF runs perfectly for genomes less than 600. However, in one of my runs, I have 1,200 genomes for which I am trying to infer reassortants based on all the 28 gene pairs. I plan to repeat this run 20 times to ensure the reassortants have high support based on both the GiRaF confidence parameter and the frequency (appearing in more than half of the 20 repeat runs). Unfortunately, when I set all the 20 repeat runs on the server they are very slow. I am using a Linux server where I am entitled to only two nodes having 30 CPUs each. Could you please advise me on how to speed up or allow parallelization of the GiRaF algorithm? I am looking forward to your response. Happy 2021! Grace

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/CSB5/GiRaF/issues/1, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACNYPUFVUVB4EUC5XF23H7LSY2VONANCNFSM4V2BPRVA.

This e-mail and any attachments are only for the use of the intended recipient and may contain material that is confidential, privileged and/or protected by the Official Secrets Act. If you are not the intended recipient, please delete it or notify the sender immediately. Please do not copy or use it for any purpose or disclose the contents to any other person.

GraceBako commented 3 years ago

Thank you for the timely response.

From my previous runs, I did not have good convergence. It could be because I am sampling a few trees (200,000) in each MrBayes run.

nnnagara commented 3 years ago

Hi all,

Is it the MCMC estimation that is taking the time (as I and Niranjan assume) or the reassortment prediction phase?

If it’s the MCMC, you could try an alternative phylogenetic estimator.

If it’s GiRaF itself, I would be surprised, but then the question would be whether we could speed up that part of the code.

Carl

On Jan 11, 2021, at 10:18 PM, Niranjan NAGARAJAN nagarajann@gis.a-star.edu.sg wrote:

Dear Grace,

Thank you for your query. Do you observe convergence of your MCMC runs? If you do, then I do not think you need 20 separate runs to confirm support for reassortment events.

Carl – do you have any additional thoughts here?

Regards,

Niranjan

From: GraceBakoUg <notifications@github.com mailto:notifications@github.com> Reply-To: CSB5/GiRaF <reply@reply.github.com mailto:reply@reply.github.com> Date: Friday, 8 January 2021 at 3:08 PM To: CSB5/GiRaF <GiRaF@noreply.github.com mailto:GiRaF@noreply.github.com> Cc: Subscribed <subscribed@noreply.github.com mailto:subscribed@noreply.github.com> Subject: [CSB5/GiRaF] Help on parallelisation (#1)

Dear Prof Nagarajan, I am Grace Nabakooza, a Ph.D. student working on influenza in Africa. Currently, I am writing a manuscript on viral reassortment and using your super tool (GiRaF) to detect reassortants. GiRaF runs perfectly for genomes less than 600. However, in one of my runs, I have 1,200 genomes for which I am trying to infer reassortants based on all the 28 gene pairs. I plan to repeat this run 20 times to ensure the reassortants have high support based on both the GiRaF confidence parameter and the frequency (appearing in more than half of the 20 repeat runs). Unfortunately, when I set all the 20 repeat runs on the server they are very slow. I am using a Linux server where I am entitled to only two nodes having 30 CPUs each. Could you please advise me on how to speed up or allow parallelization of the GiRaF algorithm? I am looking forward to your response. Happy 2021! Grace — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/CSB5/GiRaF/issues/1, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNYPUFVUVB4EUC5XF23H7LSY2VONANCNFSM4V2BPRVA. This e-mail and any attachments are only for the use of the intended recipient and may contain material that is confidential, privileged and/or protected by the Official Secrets Act. If you are not the intended recipient, please delete it or notify the sender immediately. Please do not copy or use it for any purpose or disclose the contents to any other person.

GraceBako commented 3 years ago

Dear all, Thanks for the feedback and help. Sorry for the late reply it is because the internent was cut off across the whole country because of presidential elections.

Well, the MrBayes runs worked pretty fine and I am through with that. It's the reassortment prediction phase (GiRaF) that is rather slow. I set out to compare all segments (28 pairs), and its almost 10 days it has not completed.

I am running GiRaF on a Ubuntu server using the commands attached.

Grace

On Tue, Jan 12, 2021 at 9:35 PM nnnagara notifications@github.com wrote:

Hi all,

Is it the MCMC estimation that is taking the time (as I and Niranjan assume) or the reassortment prediction phase?

If it’s the MCMC, you could try an alternative phylogenetic estimator.

If it’s GiRaF itself, I would be surprised, but then the question would be whether we could speed up that part of the code.

Carl

On Jan 11, 2021, at 10:18 PM, Niranjan NAGARAJAN < nagarajann@gis.a-star.edu.sg> wrote:

Dear Grace,

Thank you for your query. Do you observe convergence of your MCMC runs? If you do, then I do not think you need 20 separate runs to confirm support for reassortment events.

Carl – do you have any additional thoughts here?

Regards,

Niranjan

From: GraceBakoUg <notifications@github.com <mailto: notifications@github.com>> Reply-To: CSB5/GiRaF <reply@reply.github.com <mailto: reply@reply.github.com>> Date: Friday, 8 January 2021 at 3:08 PM To: CSB5/GiRaF <GiRaF@noreply.github.com <mailto: GiRaF@noreply.github.com>> Cc: Subscribed <subscribed@noreply.github.com <mailto: subscribed@noreply.github.com>> Subject: [CSB5/GiRaF] Help on parallelisation (#1)

Dear Prof Nagarajan, I am Grace Nabakooza, a Ph.D. student working on influenza in Africa. Currently, I am writing a manuscript on viral reassortment and using your super tool (GiRaF) to detect reassortants. GiRaF runs perfectly for genomes less than 600. However, in one of my runs, I have 1,200 genomes for which I am trying to infer reassortants based on all the 28 gene pairs. I plan to repeat this run 20 times to ensure the reassortants have high support based on both the GiRaF confidence parameter and the frequency (appearing in more than half of the 20 repeat runs). Unfortunately, when I set all the 20 repeat runs on the server they are very slow. I am using a Linux server where I am entitled to only two nodes having 30 CPUs each. Could you please advise me on how to speed up or allow parallelization of the GiRaF algorithm? I am looking forward to your response. Happy 2021! Grace — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub < https://github.com/CSB5/GiRaF/issues/1>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACNYPUFVUVB4EUC5XF23H7LSY2VONANCNFSM4V2BPRVA . This e-mail and any attachments are only for the use of the intended recipient and may contain material that is confidential, privileged and/or protected by the Official Secrets Act. If you are not the intended recipient, please delete it or notify the sender immediately. Please do not copy or use it for any purpose or disclose the contents to any other person.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/CSB5/GiRaF/issues/1#issuecomment-758855138, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJOJGNJUULQQML4JGIHZXL3SZSJAHANCNFSM4V2BPRVA .

!/bin/bash

run giraf analysis on in.giraf file

SBATCH --nodes=1

SBATCH --tasks-per-node=5

SBATCH --cpus-per-task=2

/mnt/lustre01/projects/infmodel/Giraf_analysis_Aug2020/giraf/bin/giraf_linux64 in.giraf

Africa_AH3N2_HA Ug_and_Africa_Ah3n2_HA_giraf.nexus.run1.t Ug_and_Africa_Ah3n2_HA_giraf.nexus.run2.t

Africa_AH3N2_PB2 Ug_and_Africa_Ah3n2_PB2_giraf.nexus.run1.t Ug_and_Africa_Ah3n2_PB2_giraf.nexus.run2.t

Africa_AH3N2_PB1 Ug_and_Africa_Ah3n2_PB1_giraf.nexus.run1.t Ug_and_Africa_Ah3n2_PB1_giraf.nexus.run2.t

Africa_AH3N2_PA Ug_and_Africa_Ah3n2_PA_giraf.nexus.run1.t Ug_and_Africa_Ah3n2_PA_giraf.nexus.run2.t

Africa_AH3N2_NP Ug_and_Africa_Ah3n2_NP_giraf.nexus.run1.t Ug_and_Africa_Ah3n2_NP_giraf.nexus.run2.t

Africa_AH3N2_NA Ug_and_Africa_Ah3n2_NA_giraf.nexus.run1.t Ug_and_Africa_Ah3n2_NA_giraf.nexus.run2.t

Africa_AH3N2_MP Ug_and_Africa_Ah3n2_MP_giraf.nexus.run1.t Ug_and_Africa_Ah3n2_MP_giraf.nexus.run2.t

Africa_AH3N2_NS Ug_and_Africa_Ah3n2_NS_giraf.nexus.run1.t Ug_and_Africa_Ah3n2_NS_giraf.nexus.run2.t

GraceBako commented 3 years ago

Dear Prof Niranjan and Carl,

Sorry for my continuous emails but I am glad you can help. Apart from the parallelisation, I have two issues that I would require your expertise opinion about.

1. Burn-in I run MrBayes with a burn-in of 500. From the two MrBayes runs I have a total 400,000 trees generated per segment, so if giraf further ignores the first 1,000, leaving 390,000 trees to work with. Do you think the excluded trees would affect the results in any way? If yes, I will have to repeat all the analysis including the --burnin=0 option when calling giraf.

2. MrBayes output Also, I noticed that in your "testdata" folder, the MrBayes run1.t file has two nexus files and a total of 400,000 trees (please see attached). And this could be because your "run_mrbayes.sh" has two chucks of commands (runs twice). However, I run mrbayes using the code below and my run1.t files have one nexus file (with 200,000 trees).

My code:

begin mrbayes;

set autoclose=yes nowarn=yes;

execute Ug_and_Africa_AH1N1pdm09_HA_giraf.nexus;

lset nst=6 rates=invgamma;

mcmc ngen=200000 samplefreq=200 nruns=2;

sump burnin=500;

sumt burnin=500;

end;

Do you think these alterations may affect my results?

Looking forward to hearing from you. Grace

On Tue, Jan 19, 2021 at 9:47 AM Grace Nabakooza nabkgrace@gmail.com wrote:

Dear all, Thanks for the feedback and help. Sorry for the late reply it is because the internent was cut off across the whole country because of presidential elections.

Well, the MrBayes runs worked pretty fine and I am through with that. It's the reassortment prediction phase (GiRaF) that is rather slow. I set out to compare all segments (28 pairs), and its almost 10 days it has not completed.

I am running GiRaF on a Ubuntu server using the commands attached.

Grace

On Tue, Jan 12, 2021 at 9:35 PM nnnagara notifications@github.com wrote:

Hi all,

Is it the MCMC estimation that is taking the time (as I and Niranjan assume) or the reassortment prediction phase?

If it’s the MCMC, you could try an alternative phylogenetic estimator.

If it’s GiRaF itself, I would be surprised, but then the question would be whether we could speed up that part of the code.

Carl

On Jan 11, 2021, at 10:18 PM, Niranjan NAGARAJAN < nagarajann@gis.a-star.edu.sg> wrote:

Dear Grace,

Thank you for your query. Do you observe convergence of your MCMC runs? If you do, then I do not think you need 20 separate runs to confirm support for reassortment events.

Carl – do you have any additional thoughts here?

Regards,

Niranjan

From: GraceBakoUg <notifications@github.com <mailto: notifications@github.com>> Reply-To: CSB5/GiRaF <reply@reply.github.com <mailto: reply@reply.github.com>> Date: Friday, 8 January 2021 at 3:08 PM To: CSB5/GiRaF <GiRaF@noreply.github.com <mailto: GiRaF@noreply.github.com>> Cc: Subscribed <subscribed@noreply.github.com <mailto: subscribed@noreply.github.com>> Subject: [CSB5/GiRaF] Help on parallelisation (#1)

Dear Prof Nagarajan, I am Grace Nabakooza, a Ph.D. student working on influenza in Africa. Currently, I am writing a manuscript on viral reassortment and using your super tool (GiRaF) to detect reassortants. GiRaF runs perfectly for genomes less than 600. However, in one of my runs, I have 1,200 genomes for which I am trying to infer reassortants based on all the 28 gene pairs. I plan to repeat this run 20 times to ensure the reassortants have high support based on both the GiRaF confidence parameter and the frequency (appearing in more than half of the 20 repeat runs). Unfortunately, when I set all the 20 repeat runs on the server they are very slow. I am using a Linux server where I am entitled to only two nodes having 30 CPUs each. Could you please advise me on how to speed up or allow parallelization of the GiRaF algorithm? I am looking forward to your response. Happy 2021! Grace — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub < https://github.com/CSB5/GiRaF/issues/1>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACNYPUFVUVB4EUC5XF23H7LSY2VONANCNFSM4V2BPRVA . This e-mail and any attachments are only for the use of the intended recipient and may contain material that is confidential, privileged and/or protected by the Official Secrets Act. If you are not the intended recipient, please delete it or notify the sender immediately. Please do not copy or use it for any purpose or disclose the contents to any other person.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/CSB5/GiRaF/issues/1#issuecomment-758855138, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJOJGNJUULQQML4JGIHZXL3SZSJAHANCNFSM4V2BPRVA .

CSB5 / GiRaF

Help on parallelisation #1

!/bin/bash

run giraf analysis on in.giraf file

SBATCH --nodes=1

SBATCH --tasks-per-node=5

SBATCH --cpus-per-task=2