ExaScience / elprep

elPrep: a high-performance tool for analyzing sequence alignment/map files in sequencing pipelines.
Other
287 stars 40 forks source link

coverage vs vcf.gz size discrepancy #60

Closed desmodus1984 closed 2 years ago

desmodus1984 commented 2 years ago

Hi, I am using elprep to do variant calling of samples with ~10X coverage and the sample using for genome assembly. The vcf of the ~10X samples were on average 5GB and that genome-assembly sample vcf was 2.9GB. Any ideas of why the samples with ~25 GB reads generated bigger vcf files than the one generated with the 145GB sample?

The code was the same for every sample. I mapped the reads with bwamem-2 followed by vcf calling with elprep.

elprep sfm PA113.sorted.bam PA113.output.bam --nr-of-threads 28 --tmp-path $TMPDIR \ --mark-duplicates --mark-optical-duplicates PA113.metrics \ --sorting-order coordinate \ --bqsr PA113.recal \ --reference /users/PHS0338/jpac1984/data/myse-hapog.elfasta \ --haplotypecaller PA113.vcf.gz

Does it matter whether the input was bam or sam? or coordinate-sorted or not?

Thanks;

caherzee commented 2 years ago

Hi,

It seems possible to me that a larger input bam produces a smaller output vcf compared to a smaller input bam.

Do you suspect there is a bug in elPrep because this outcome makes no sense from a biological perspective in your case? If so, you could remove the haplotypecaller step from the elprep invocation and try to perform that part with GATK and see what the outcome is? If you find there is a bug, please let us know.

If you pass the --sorting-order coordinate option to elPrep, the tool will sort the input by coordinate order. It is not necessary to pre-sort the input before passing it to elPrep.

elPrep accepts both bam and sam as inputs. It should normally not matter for the final outcome.

Thanks!

Kind regards, Charlotte

desmodus1984 commented 2 years ago

Hi Charlotte,

I wanted to ask you a question. For the ~5 GB vcf files, I mapped fastq files, while for the ~2 GB vcf from the reference, I mapped reads but in fasta format. Do you think this can explain the size difference, and is okay to use as input a bam from fasta reads or is it better to use a bam from mapped fastq files?

Thanks

Get Outlook for Androidhttps://aka.ms/AAb9ysg


From: Charlotte Herzeel @.> Sent: Monday, March 7, 2022, 7:59 AM To: ExaScience/elprep @.> Cc: Aguilar Cabezas, Juan Pablo @.>; Author @.> Subject: Re: [ExaScience/elprep] coverage vs vcf.gz size discrepancy (Issue #60)


NOTICE: This message was sent from outside Ohio University. Please use caution when clicking links or opening attachments in this message.


Hi,

It seems possible to me that a larger input bam produces a smaller output vcf compared to a smaller input bam.

Do you suspect there is a bug in elPrep because this outcome makes no sense from a biological perspective in your case? If so, you could remove the haplotypecaller step from the elprep invocation and try to perform that part with GATK and see what the outcome is? If you find there is a bug, please let us know.

If you pass the --sorting-order coordinate option to elPrep, the tool will sort the input by coordinate order. It is not necessary to pre-sort the input before passing it to elPrep.

elPrep accepts both bam and sam as inputs. It should normally not matter for the final outcome.

Thanks!

Kind regards, Charlotte

— Reply to this email directly, view it on GitHubhttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FExaScience%2Felprep%2Fissues%2F60%23issuecomment-1060660305&data=04%7C01%7Cja569116%40ohio.edu%7Cec5ba1ed41eb4f0b9d7108da003a5344%7Cf3308007477c4a70888934611817c55a%7C0%7C0%7C637822547800089736%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=bq3m3X5FU7aVvrkFEK%2BSOuqWmNnQ3qiTcq%2FfvWdpznk%3D&reserved=0, or unsubscribehttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAJWD2VIH3NHPLALV4V35IMLU6X4TJANCNFSM5PYEIHUQ&data=04%7C01%7Cja569116%40ohio.edu%7Cec5ba1ed41eb4f0b9d7108da003a5344%7Cf3308007477c4a70888934611817c55a%7C0%7C0%7C637822547800089736%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=9Gqixj27DetnFqAHujBiAlgQaYgOyaxci%2FKW5iN0FUo%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you authored the thread.Message ID: @.***>

caherzee commented 2 years ago

Hi,

The FASTA and FASTQ formats store different information about sequences. As far as I am aware, FASTA is mostly used for storing reference sequences, whereas the FASTQ format is used for storing reads.

The FASTQ format stores information such as base quality scores, sequence identifier, etc, which are all used by the mapping algorithm, as well as other downstream analysis algorithms. This information is therefore transferred to the bam files. For example, the quality scores, for which there is one for each base, are also stored in the bam. If you map read stored in FASTA -which I did not know is possible- then you won't have this information and cannot store it in the output bam. Hence that would easily half your output size.

I am not sure it is meaningful to run the mark duplicates/bqsr/haplotype caller pipeline on bams that lack information such as quality scores as the underlying algorithms rely on this information...

Best, Charlotte

desmodus1984 commented 2 years ago

Hi,

I was having the same question but as you saw I did the mapping and then the variant calling with the fasta-bam file. I was wondering if that lack of information did get translate and hence the final vcf was much smaller than the other.

Get Outlook for Androidhttps://aka.ms/AAb9ysg


From: Charlotte Herzeel @.> Sent: Thursday, March 17, 2022, 5:52 AM To: ExaScience/elprep @.> Cc: Aguilar Cabezas, Juan Pablo @.>; Author @.> Subject: Re: [ExaScience/elprep] coverage vs vcf.gz size discrepancy (Issue #60)


NOTICE: This message was sent from outside Ohio University. Please use caution when clicking links or opening attachments in this message.


Hi,

The FASTA and FASTQ formats store different information about sequences. As far as I am aware, FASTA is mostly used for storing reference sequences, whereas the FASTQ format is used for storing reads.

The FASTQ format stores information such as base quality scores, sequence identifier, etc, which are all used by the mapping algorithm, as well as other downstream analysis algorithms. This information is therefore transferred to the bam files. For example, the quality scores, for which there is one for each base, are also stored in the bam. If you map read stored in FASTA -which I did not know is possible- then you won't have this information and cannot store it in the output bam. Hence that would easily half your output size.

I am not sure it is meaningful to run the mark duplicates/bqsr/haplotype caller pipeline on bams that lack information such as quality scores as the underlying algorithms rely on this information...

Best, Charlotte

— Reply to this email directly, view it on GitHubhttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FExaScience%2Felprep%2Fissues%2F60%23issuecomment-1070686717&data=04%7C01%7Cja569116%40ohio.edu%7C89bd46f6c40b4071e3b908da07fbc7bd%7Cf3308007477c4a70888934611817c55a%7C0%7C0%7C637831075227024716%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=nCGRxlo5XKEyA%2ByliZddkoXgIo2E0ko0b4Esz5htCQ0%3D&reserved=0, or unsubscribehttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAJWD2VPJ4APSOJBQHGNF6DLVAL6D5ANCNFSM5PYEIHUQ&data=04%7C01%7Cja569116%40ohio.edu%7C89bd46f6c40b4071e3b908da07fbc7bd%7Cf3308007477c4a70888934611817c55a%7C0%7C0%7C637831075227180967%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=FlNy7Ik2BW33HfmxEJh0lZ27eMtebv1bmqtKQMBK%2Bnw%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you authored the thread.Message ID: @.***>