TRON-Bioinformatics / covigator

CoVigator - Monitoring SARS-CoV-2 mutations
MIT License
11 stars 1 forks source link

Host_tax_id field is NA for all queried ENA runs #130

Closed johausmann closed 6 months ago

johausmann commented 11 months ago

During the data update I noticed a serious problem in the ENA accessor. All samples were filtered out because of the missing host_tax_id, even though this is a field that is requested in the URL. It seems that this field is always NA no matter what samples we query. I also did a quick check with some Sars-Cov2 ENA runs that were already processed in a previous database update. Here we can observe the same problem.

image

https://www.ebi.ac.uk/ena/portal/api/search?result=read_run&format=tsv&query=%22run_accession=DRR287659%22&fields=host_tax_id,host_scientific_name

However, the host_scientific_name field is returned. We could update the accessor module to filter by either host_tx_id or host_scientific_name.

johausmann commented 9 months ago

For the new data release, we skip the host_tax_id check and check the host_scientific_name. Here we will accept the values Homo Sapiens and NA, since we assume that a large part of the NA samples come from human donors.

priesgo commented 6 months ago

Fixed by ENA upstream