AstraZeneca-NGS / VarDict

VarDict
MIT License
187 stars 61 forks source link

No parallelization and high time consumption #147

Open rahul-yadav-supra opened 4 years ago

rahul-yadav-supra commented 4 years ago

Dear Team,

I would like to inform you that I am using VarDict. As I am using Vardict, I found it is not parallelizable and it is taking around 8 hrs for my samples. Even if I split my sample in to smaller pieces, it is taking 4 hrs for running on all the split pieces in parallel. Please suggest me a way to run the variant caller in a much faster way.

Thanking You in anticipation.

With Best Regards, Rahul Yadav

PolinaBevad commented 4 years ago

Hi @rahul-yadav-supra,

Thank you for using VarDict! Yes, you are right, VarDict Perl version is not parallelized, but you can parallelize it by running on separate BED regions through gnu parallel or something like this and combine results afterwards.

Anyway, I recommend to use VarDictJava version (repository https://github.com/AstraZeneca-NGS/VarDictJava). VarDictJava is the port of this VarDict repository but it supports multiprocessing, runs much more faster, has tests and it is currently supported much more than Perl version.

You have to provide number of cores with option -th when running VarDictJava to use parallelization.

If you have high coverage or samples are WGS, you can also use --fisher option in VarDictJava to switch from using R scripts after the main Vardict process to internal Java calculations of p-values. Our tests show that on small samples the performance is mostly the same, but with WGS samples R script can be very slow so this can decrease processing time as well. Some information of parallel runs can be found here: https://github.com/AstraZeneca-NGS/VarDictJava/wiki/Best-practics-for-WGS-and-multithreading

Hope this helps!

rahul-yadav-supra commented 4 years ago

Dear @PolinaBevad ,

I am very thankful for the help and the information that you have shared. I found the java version to be very fast and very helpful. But I have a few queries. In the perl variant of the caller, I am getting more calls than the java version. I had a difference of 3900 variants while using the java version (I mean 3900 less variant calls in case of VarDict-Java with respect to VarDict Perl version). I am not able to understand the reason. Please help me understand the reason for the same.

Thanking You in advance.

With best regards, Rahul

PolinaBevad commented 4 years ago

Hi Rahul,

That must not appear in typical situation, we are trying to keep both versions at the same results. What versions did you use for perl and java VarDict respectively? If both are >=1.6.0, the results must be the same (previously Perl version provided non-determenistic behavior on some variants in realignment). Also do you use the same options while running both versions?

If that doesn't help, please can you provide a small BAM slice where you get a different result in both versions? Thank you!

rahul-yadav-supra commented 4 years ago

Dear @PolinaBevad

I thank you for all the help and information you gave me. As per your recommendations, I checked the parameters and and all other aspects/prospects of change in the command and the file.For convenience's sake I took a file and split on the basis of only 1 Chromosome (Chromosome 19). Post this I performed the analysis with VarDict Perl version as well as VarDict-Java version keeping all the parameters same. Yet I am getting a difference of 17 variant calls (more in case of perl version than Java Version). Also, there are some variants that are present in Java version vcf file and not in the perl version. Actually, there is a total of 27 differences including both perl and Java version vcf's.

rahul-yadav-supra commented 4 years ago

Dear @PolinaBevad Please let me know once you have downloaded the bam file so that I can delete the link and remove the bam file.

Thanks and Best Regards, Rahul Yadav

PolinaBevad commented 4 years ago

Thank you, Rahul, I've downloaded the BAM file. Can you please share the command line you run and BED file (or region from it)?

Thanks!

rahul-yadav-supra commented 4 years ago

Dear @PolinaBevad,

Please let me know once you finished downloading the bed file. Also let me know if you find any differences in the results and how to correct the same.

Thanking You in anticipation.

With best regards, Rahul

PolinaBevad commented 4 years ago

Rahul, thank you, I've downloaded the target file as well, you can delete it. I let you know about the results, not sure about today, but will try to check this on Monday. Thank you!

rahul-yadav-supra commented 4 years ago

Sure @PolinaBevad.

Thank You for your time and help to look into the issue. Thank You so much.

rahul-yadav-supra commented 4 years ago

Good Morning @PolinaBevad , I wish you are good health and hope that you are doing well. I just thought of asking you whether you were able to look into the issue #147 and find out what is happening between the two callers (Perl and Java Variants). Thanking you for your help.

With Best Regards, Rahul Yadav

PolinaBevad commented 4 years ago

Hi Rahul,

Thank you a lot! I've checked VarDict Java and Perl master versions with parameters -c 1 -S 2 -E 3 -g 4 -f 0.001 and your data (BAM file and BED file) and I got the identical number of variants for both runs: total 25524 raw variants (so before using teststrandbias.R and var2vcf_valid.pl scripts). Java version was run in multithread mode with -th.

I think you use different versions of Vardict Java and Perl and that can be the reason. Can you please try to fetch master branch versions for both of them (by cloning repository) and try to rerun script? Master version of Java can be compiled after cloning as described here: https://github.com/AstraZeneca-NGS/VarDictJava#getting-started And master version of perl can be simply clone - it is ready to use in repository.

I know only one case at the moment when result can differ in both versions - with option -k 0, i.e. when local realignment is disabled. In Perl version case it will apply modification algorithms to CIGAR when in Java disabling local realignment will also disable CIGAR modification. In other cases the behavior must be the same.

Can you please try to clone latest versions of VarDict and let me know if you will still have different results?

Thank you!

rahul-yadav-supra commented 4 years ago

Thank You so much Polina. I will do as you suggested and will let you know of the result.

Thank you very much for your anticipated cooperation.

With best regards, Rahul Yadav

On Mon, 20 Jul, 2020, 22:28 Polina Bevad, notifications@github.com wrote:

Hi Rahul,

Thank you a lot! I've checked VarDict Java and Perl master versions with parameters -c 1 -S 2 -E 3 -g 4 -f 0.001 and your data (BAM file and BED file) and I got the identical number of variants for both runs: total 25524 raw variants (so before using teststrandbias.R and var2vcf_valid.pl scripts). Java version was run in multithread mode with -th.

I think you use different versions of Vardict Java and Perl and that can be the reason. Can you please try to fetch master branch versions for both of them (by cloning repository) and try to rerun script? Master version of Java can be compiled after cloning as described here: https://github.com/AstraZeneca-NGS/VarDictJava#getting-started And master version of perl can be simply clone - it is ready to use in repository.

I know only one case at the moment when result can differ in both versions

  • with option -k 0, i.e. when local realignment is disabled. In Perl version case it will apply modification algorithms to CIGAR when in Java disabling local realignment will also disable CIGAR modification. In other cases the behavior must be the same.

Can you please try to clone latest versions of VarDict and let me know if you will still have different results?

Thank you!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/AstraZeneca-NGS/VarDict/issues/147#issuecomment-661188253, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOB652EYP3HZWQFWUTPUT4TR4RZR5ANCNFSM4O2P5OVA .

--

The information in this e-mail is confidential and may be legally

privileged. It is intended solely for the addressee. Access to this e-mail  by anyone else is unauthorized. If you have received this communication in error, please address with the subject heading "Received in error," send to the original sender , then delete the e-mail and destroy any copies of it. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. Opinions, conclusions and other information in this e-mail and any attachments that do not relate to the  official business of the firm are neither given nor endorsed by it.The sender is neither liable for the proper and complete transmission of the  information contained in this communication nor for any delay in its receipt. It cannot be guaranteed that e-mail communications are secure or error-free, as information could be intercepted, corrupted, amended, lost,

destroyed, arrive late , is incomplete, or contain viruses.

rahul-yadav-supra commented 4 years ago

Dear @PolinaBevad ,

I request you to kindly tell me the exact command that you used for both the Perl and Java Versions of the Variant Caller.

Thanking you for anticipation.

With best regards, Rahul Yadav

PolinaBevad commented 4 years ago

Hi Rahul,

I've run it with: VarDict -G human_g1k_v37_decoy.fasta -c 1 -S 2 -E 3 -g 4 -f 0.001 -b final.bam target.txt | teststrandbias.R | var2vcf_valid.pl > vardict.result.vcf

For Java version I've added -th 3 option.

rahul-yadav-supra commented 4 years ago

Thank You Polina. Thank you for sharing the commands and for all the help and support you provided. I will let you know about the results.

With best regards, Rahul Yadav

rahul-yadav-supra commented 4 years ago

Dear @PolinaBevad,

I am thankful to you for your help. I would like to convey to you that I am using the 2019.06.04 Version of Perl which I think is the Version 1.6 itself (please correct me if I am wrong). I have also downloaded the Version 1.6 for the java Version. I installed the Perl Version using conda.

I also cloned the Perl version as you asked me to do and I am trying to run the same from the folder where it has been cloned. But the problem is when I am still getting some variants mismatch. It would be of great help if you provide me the links exactly to clone the Java and Perl variants of Vardict. I will create a virtual system and recheck to see if the number of variants is the same.

Also, I just wanted to convey to you that the I am getting around 3 variants differing between the java and Perl variants by using the same commands for both the Perl and Java Version of Vardict, out of which Java has 2 variants that are not present in Perl versions output vcf and 1 Variant of Perl output vcf is not present in Java version's output vcf. The total number of variants differing in terms of the count is just 1 but the total number of different variants is 3. Please suggest me something to verify and solve this issue.

Thanking you in anticipation.

With best regards, Rahul Yadav

PolinaBevad commented 4 years ago

Hi Rahul, Sorry for the late answer!

Yes, you are right: 2019.06.04 Version of Perl is VarDict 1.6.0. The last versions in master branch are correlated to each other in both repositories, so you can simply clone both VarDict Java and Perl repositories and use them. Java: https://github.com/AstraZeneca-NGS/VarDictJava.git Perl: https://github.com/AstraZeneca-NGS/VarDict.git

Do you get the difference in the test set that you sent to me previously? This one I run on last master branches and I didn't get any differences. Can you tell me what is the perl version (i.e. language version) that you use in you environment? Please, run perl -v to check. I remember about few problems with old Perl language version as Perl changed the way of working with dictionaries (ordering) starting from 5.8.0. And what exactly is a difference in these 3 variants? Can you just depersonalize these variants and add the result here?

Thank you!