bioinf-mcb / gisaid-scrapper

Scrapping tool for GISAID data regarding SARS-CoV-2
MIT License
41 stars 16 forks source link

Casing in fasta header ID #2

Closed tolot27 closed 4 years ago

tolot27 commented 4 years ago

Currently, the FASTA header contains uppercase identifiers and the HCOV-19 prefix. The later one can be removed quite easy but patching the header ID into correct case is more complicated. Can you correct it during write of the file to disk?

A workaround is to change the augur filter dictionary lookup (https://github.com/nextstrain/augur/blob/904ed6d6c154753cd2dd2d210768f2e4df1e8bb6/augur/filter.py#L135) to be case-insensitive.

wwydmanski commented 4 years ago

So for example a HCOV-19/WUHAN/IVDC-HB-01/2019|EPI_ISL_402119 identifier should be hCoV-19/Wuhan/IVDC-HB-01/2019 instead?

tolot27 commented 4 years ago

So for example a HCOV-19/WUHAN/IVDC-HB-01/2019|EPI_ISL_402119 identifier should be hCoV-19/Wuhan/IVDC-HB-01/2019 instead?

It should be hCoV-19/Wuhan/IVDC-HB-01/2019|EPI_ISL_402119|2019-12-30. That is how it appears in the original fasta.

The Download button is currently back. Hence, no need to implement this right now.

wwydmanski commented 4 years ago

The Download button is currently back.

Not for me, unfortunately :/

tolot27 commented 4 years ago

The Download button is currently back.

Not for me, unfortunately :/

Did you contact gisaid support? Maybe, they have a switch to activate it on a customer basis.