BIMSBbioinfo / pigx_sars-cov-2

PiGx SARS-CoV-2 wastewater sequencing pipeline
GNU General Public License v3.0
18 stars 3 forks source link

Outputs from only *.vcf file #95

Open sinclairify opened 2 years ago

sinclairify commented 2 years ago

I'd like to see how to generate mutation and lineage reports using only .vcf as an input. The commercial lab that provides our county wastewater sequencing services provides a .vcf , but doesn't provide the raw fastq file in an effort to protect their companies proprietary primers. They don't provide wastewater sequencing reports and process the the extracted RNA (from wastewater) as a clinical sample. The result is a few different files and I'd like to use the pigx to generate some lineage and mutation charts.

The *.vcf is generated from our commercial lab after they:

  1. Align NGS reads to human genome and the seven coronaviruses that are known to affect humans
  2. Trim the Fulgent primers from the ends of the reads that uniquely align to SARS-CoV-2 using the iVar trim utility
  3. Compute coverage pileup using Samtools mpileup utility
  4. Generate VCF using VarScan v2.4.3

    We have a few outputs from them <pangolin_##_trimmed.csv>, <##_ivar_consensus_trimmed_qual.fa>, <##_ivar_consensus_trimmedqual.txt>, and <VarScan##_trimmed.vcf>.

I'm providing some files that they returned to us in late November. I'm assuming the *.vcf is the best bet. Any help would be appreciated.

SH7951.zip

vicfabienne commented 2 years ago

Hey, thank you for the request! I looked into it. From what I can see it should be doable. However, I'm not yet completely sure about how to deal with the missing Quality Control. Any analysis and calculation would have been performed under the strong assumption, that all samples are of comparable quality i.e. comparable sequencing depth across the whole genome at the mutation sites, reference genome coverage etc. pp.. Since there is no way to do this automatically with only the vcf files the reports can only be so reliant on being taken on their own. You would need to have that QC part extra.

If you still think it's a possibility that can help you I would go forward with this on an extra branch. I can't promise anything but if it works as expected I'd try to get a version working there.

sinclairify commented 2 years ago

Hi. Thanks for offering that. Its a great way to go because we do have a broad QC numbers in some of the outputs that are provided. I will manually check a few items:

I suggest proceeding and I'll ask that company about more detail. Thanks!

jonasfreimuth commented 2 years ago

Hello,

here is a little update: I am currently working on enabling direct vcf input. However, there are some INFO fields that need to be present, namely Allele Frequency (AF) and Depth (DP). The information from both those fields is required by the downstream analysis. I tried running the pipeline on the vcf files you provided, but they are lacking that info. Also, when I try to work around this, no nucleotide info gets found by vep, which I am still investigating.

So if you (still) want to use the pigx-sars-cov-2 pipeline to analyse your data, you would probably need to get your variants called with lofreq. There is a version that should be capable of producing variant reports from lofreq vcf output alone on brach predefine_file_io in my personal repo (not thoroughly tested at all).

sinclairify commented 2 years ago

Hi Thanks Jonas, We were able to eventually obtain some raw fastq, but not for the majority of our weekly assessments. I’m going to try the predefine_file_iohttps://urldefense.com/v3/__https:/github.com/jonasfreimuth/pigx_sars-cov-2/tree/predefine-rule-io__;!!DfVsRZep!kpz-Nz-A5uLwO7b3TFesinyth5tNcF8RZBpu6Ez1DBrjR9qY6q2ilTqkPVzAXVQ$ option that you detailed below.

From: Jonas Freimuth @.> Sent: Sunday, July 10, 2022 2:31 PM To: BIMSBbioinfo/pigx_sars-cov-2 @.> Cc: Sinclair, Ryan (LLU) @.>; Author @.> Subject: [EXTERNAL] Re: [BIMSBbioinfo/pigx_sars-cov-2] Outputs from only *.vcf file (Issue #95)

CAUTION: This message originated from outside the LLUH email system. Do not open attachments or follow links unless you have verified the legitimacy of the sender and its content. If you receive a suspicious email, you may forward it to @.**@.> and then delete the suspicious email.


Hello,

here is a little update: I am currently working on enabling direct vcf input. However, there are some INFO fields that need to be present, namely Allele Frequency (AF) and Depth (DP). The information from both those fields is required by the downstream analysis. I tried running the pipeline on the vcf files you provided, but they are lacking that info. Also, when I try to work around this, no nucleotide info gets found by vep, which I am still investigating.

So if you (still) want to use the pigx-sars-cov-2 pipeline to analyse your data, you would probably need to get your variants called with lofreq. There is a version that should be capable of producing variant reports from lofreq vcf output alone on brach predefine_file_iohttps://urldefense.com/v3/__https:/github.com/jonasfreimuth/pigx_sars-cov-2/tree/predefine-rule-io__;!!DfVsRZep!kpz-Nz-A5uLwO7b3TFesinyth5tNcF8RZBpu6Ez1DBrjR9qY6q2ilTqkPVzAXVQ$ in my personal repo (not thoroughly tested at all).

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https:/github.com/BIMSBbioinfo/pigx_sars-cov-2/issues/95*issuecomment-1179803401__;Iw!!DfVsRZep!kpz-Nz-A5uLwO7b3TFesinyth5tNcF8RZBpu6Ez1DBrjR9qY6q2ilTqkV8LVeVc$, or unsubscribehttps://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/AWOJAP273LWDFSVVRYUOK63VTM6IVANCNFSM5K6QAO7A__;!!DfVsRZep!kpz-Nz-A5uLwO7b3TFesinyth5tNcF8RZBpu6Ez1DBrjR9qY6q2ilTqkx7kbjBU$. You are receiving this because you authored the thread.Message ID: @.**@.>>

CONFIDENTIALITY NOTICE: This e-mail communication and any attachments may contain confidential and privileged information for the use of the designated recipients named above. If you are not the intended recipient, you are hereby notified that you have received this communication in error and that any review, disclosure, dissemination, distribution or copying of it or its contents is prohibited. If you have received this communication in error, please notify me immediately by replying to this message and destroy all copies of this communication and any attachments. Thank you.

jonasfreimuth commented 2 years ago

FYI, development of that branch will now take place on predef-rule-io-dev, due to git reasons

jonasfreimuth commented 1 year ago

The changes are now merged into main in #142. But I have no updates on getting nucleotide info from the files @sinclairify provided.

sinclairify commented 1 year ago

Thank you @jonasfreimuth. Our governmental partners are working with the sequencing company to provide this (Fulgent). They had some staff changes and lost track of our progress. We will keep trying.