baymlab / wastewater_analysis

Variant abundance estimation for SARS-CoV-2 in wastewater using RNA-Seq quantification
MIT License
17 stars 8 forks source link

0 sequences from selection found #8

Open chirrie opened 2 years ago

chirrie commented 2 years ago

Why I am getting below when I run the preprocessing script?

1323 sequences selected searching fasta and writing sequences to output directory... 3679 sequences from input fasta processed 0 sequences from selection found

jbaaijens commented 2 years ago

This suggests that your metadata doesn't match the sequences provided. Could also be that the GISAID formatting has changed again. Could you check whether the sequence identifiers in your fasta file are of the form <Virus name>|<Collection date>|<Submission date> as given in the metadata?

chirrie commented 2 years ago

Sure...There are quite some changes is which affect the pipeline. could please have a look at it. I am using the latest script in the pipeline folder

jbaaijens commented 2 years ago

Ok I see the problem, the sequence identifiers are again formatted differently. It's an easy fix, I'll try to do it tomorrow.

chirrie commented 2 years ago

Did get a moment to fix it?
Many thanks

jbaaijens commented 2 years ago

Just made some changes, could you pull the latest commit and give it a try?

jbaaijens commented 2 years ago

I just downloaded the most recent GISAID data and the formatting hasn't changed. It seems the data you have shown above is actually older, it corresponds to the format I encountered ~9 months ago. So you could try the files in the manuscript folder for processing your data. However, I strongly recommend you to download the latest version of the full GISAID database and work with the scripts in pipeline.

chirrie commented 2 years ago

The N-Content line also affects select_sample script

jbaaijens commented 2 years ago

Yes, it uses this information. In the version from 9 months ago (see manuscript) we calculated N-content ourselves, but in the mean time it's part of the GISAID metadata.

chirrie commented 2 years ago

Could you please download hcov_africa.fasta and hcov_africa.tsv and try running on the scripts without changing anything? That is what I am using and getting errors. I downloaded from region-specific Auspice source files

There is data in the manuscript folder.. What I have is from GISAID and I think it latest since I downloaded it last week. I had narrow down to specific regions since I wanted just a few data to test the pipeline with my data firs

jbaaijens commented 2 years ago

Ah so what you're using is not actually the full GISAID data, also not for Africa. These are Auspice files which are used for visualisation with Nextstrain (https://docs.nextstrain.org/projects/auspice/en/stable/) and they only have very few sequences. For building a good reference set you need the full GISAID database [GISAID -> EpiCoV -> Downloads -> Download packages -> FASTA (for the sequences) and metadata (for the metadata)]

jbaaijens commented 2 years ago

If it helps, I can build an Africa-specific reference set and share the sequences / sequence identifiers.

chirrie commented 2 years ago

Weird enough I can't see the download packages tab . GISAID -> EpiCoV -> Downloads -> The download tab takes me to below. I can not see the option download packages then FASTA (for the sequences) and metadata (for the metadata)]. After downloads, it shows a tab for Alignment and proteins, submission and variant stats and finally genomic epidemiology tab. I have share a screen snap to your email

On Tue, Nov 30, 2021 at 12:13 PM jbaaijens @.***> wrote:

Ah so what you're using is not actually the full GISAID data, also not for Africa. These are Auspice files which are used for visualisation with Nextstrain (https://docs.nextstrain.org/projects/auspice/en/stable/) and they only have very few sequences. For building a good reference set you need the full GISAID database [GISAID -> EpiCoV -> Downloads -> Download packages -> FASTA (for the sequences) and metadata (for the metadata)]

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/baymlab/wastewater_analysis/issues/8#issuecomment-982433693, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZTQ2U3INQDXRZ5DYHCEOLUOSITTANCNFSM5I27YN2Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

chirrie commented 2 years ago

If it helps, I can build an Africa-specific reference set and share the sequences/sequence identifiers.

I will appreciate it if I can get this.

chirrie commented 2 years ago

I just downloaded the most recent GISAID data and the formatting hasn't changed. It seems the data you have shown above is actually older, it corresponds to the format I encountered ~9 months ago. So you could try the files in the manuscript folder for processing your data. However, I strongly recommend you to download the latest version of the full GISAID database and work with the scripts in pipeline.

I have tried downloading few sequences around 100 per lineage, but I am getting 0 sequences found from selection. Could please help out on this.

jbaaijens commented 2 years ago

Could you again post what your sequence identifiers look like in the fasta file?

jbaaijens commented 2 years ago

If it helps, I can build an Africa-specific reference set and share the sequences/sequence identifiers.

I will appreciate it if I can get this.

I will build it. Can you send me an email at j.a.baaijens[at]tudelft.nl?

chirrie commented 2 years ago

If it helps, I can build an Africa-specific reference set and share the sequences/sequence identifiers.

I will appreciate it if I can get this.

I will build it. Can you send me an email at j.a.baaijens[at]tudelft.nl?

I have sent you an email. Please have a look at it

chirrie commented 2 years ago

Could you again post what your sequence identifiers look like in the fasta file?

hCoV-19/Reunion/HCL021109894801/2021|EPI_ISL_2676670|2021-06-08

Dipti-IISERpune commented 2 years ago

I am facing the exact same issue. I couldn't get the desired details from GISAID, so I prepared the .tsv using details and accession IDs plus Fasta for those sequences. It gave me the above-mentioned error. Now, I am trying to use fasta and tsv for region Asia and country India from GISAID download section. Will it help me to run the code if I rearrange the .tsv as given in the example?

jbaaijens commented 1 year ago

Unfortunately the GISAID metadata headers have changed over time, so yes, it should be resolved by renaming columns in the metadata. You could also try VLQ-nf, a nextflow implementation of our pipeline: https://github.com/rki-mf1/vlq-nf