Open chirrie opened 2 years ago
This suggests that your metadata doesn't match the sequences provided. Could also be that the GISAID formatting has changed again. Could you check whether the sequence identifiers in your fasta file are of the form <Virus name>|<Collection date>|<Submission date>
as given in the metadata?
Sure...There are quite some changes is which affect the pipeline. could please have a look at it. I am using the latest script in the pipeline folder
Ok I see the problem, the sequence identifiers are again formatted differently. It's an easy fix, I'll try to do it tomorrow.
Did get a moment to fix it?
Many thanks
Just made some changes, could you pull the latest commit and give it a try?
I just downloaded the most recent GISAID data and the formatting hasn't changed. It seems the data you have shown above is actually older, it corresponds to the format I encountered ~9 months ago. So you could try the files in the manuscript
folder for processing your data. However, I strongly recommend you to download the latest version of the full GISAID database and work with the scripts in pipeline
.
The N-Content line also affects select_sample script
Yes, it uses this information. In the version from 9 months ago (see manuscript
) we calculated N-content ourselves, but in the mean time it's part of the GISAID metadata.
Could you please download hcov_africa.fasta and hcov_africa.tsv and try running on the scripts without changing anything? That is what I am using and getting errors. I downloaded from region-specific Auspice source files
There is data in the manuscript folder.. What I have is from GISAID and I think it latest since I downloaded it last week. I had narrow down to specific regions since I wanted just a few data to test the pipeline with my data firs
Ah so what you're using is not actually the full GISAID data, also not for Africa. These are Auspice files which are used for visualisation with Nextstrain (https://docs.nextstrain.org/projects/auspice/en/stable/) and they only have very few sequences. For building a good reference set you need the full GISAID database [GISAID -> EpiCoV -> Downloads -> Download packages -> FASTA (for the sequences) and metadata (for the metadata)]
If it helps, I can build an Africa-specific reference set and share the sequences / sequence identifiers.
Weird enough I can't see the download packages tab . GISAID -> EpiCoV -> Downloads -> The download tab takes me to below. I can not see the option download packages then FASTA (for the sequences) and metadata (for the metadata)]. After downloads, it shows a tab for Alignment and proteins, submission and variant stats and finally genomic epidemiology tab. I have share a screen snap to your email
On Tue, Nov 30, 2021 at 12:13 PM jbaaijens @.***> wrote:
Ah so what you're using is not actually the full GISAID data, also not for Africa. These are Auspice files which are used for visualisation with Nextstrain (https://docs.nextstrain.org/projects/auspice/en/stable/) and they only have very few sequences. For building a good reference set you need the full GISAID database [GISAID -> EpiCoV -> Downloads -> Download packages -> FASTA (for the sequences) and metadata (for the metadata)]
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/baymlab/wastewater_analysis/issues/8#issuecomment-982433693, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZTQ2U3INQDXRZ5DYHCEOLUOSITTANCNFSM5I27YN2Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
If it helps, I can build an Africa-specific reference set and share the sequences/sequence identifiers.
I will appreciate it if I can get this.
I just downloaded the most recent GISAID data and the formatting hasn't changed. It seems the data you have shown above is actually older, it corresponds to the format I encountered ~9 months ago. So you could try the files in the
manuscript
folder for processing your data. However, I strongly recommend you to download the latest version of the full GISAID database and work with the scripts inpipeline
.
I have tried downloading few sequences around 100 per lineage, but I am getting 0 sequences found from selection. Could please help out on this.
Could you again post what your sequence identifiers look like in the fasta file?
If it helps, I can build an Africa-specific reference set and share the sequences/sequence identifiers.
I will appreciate it if I can get this.
I will build it. Can you send me an email at j.a.baaijens[at]tudelft.nl?
If it helps, I can build an Africa-specific reference set and share the sequences/sequence identifiers.
I will appreciate it if I can get this.
I will build it. Can you send me an email at j.a.baaijens[at]tudelft.nl?
I have sent you an email. Please have a look at it
Could you again post what your sequence identifiers look like in the fasta file?
hCoV-19/Reunion/HCL021109894801/2021|EPI_ISL_2676670|2021-06-08
I am facing the exact same issue. I couldn't get the desired details from GISAID, so I prepared the .tsv using details and accession IDs plus Fasta for those sequences. It gave me the above-mentioned error. Now, I am trying to use fasta and tsv for region Asia and country India from GISAID download section. Will it help me to run the code if I rearrange the .tsv as given in the example?
Unfortunately the GISAID metadata headers have changed over time, so yes, it should be resolved by renaming columns in the metadata. You could also try VLQ-nf, a nextflow implementation of our pipeline: https://github.com/rki-mf1/vlq-nf
Why I am getting below when I run the preprocessing script?
1323 sequences selected searching fasta and writing sequences to output directory... 3679 sequences from input fasta processed 0 sequences from selection found