ImperialCollegeLondon / safedata_validator

Python tools to validate and publish datasets using the safedata metadata format.
https://safedata-validator.readthedocs.io/
MIT License
2 stars 4 forks source link

Parse and build taxon index from NCBI style taxon table without validation #175

Open davidorme opened 1 month ago

davidorme commented 1 month ago

This is the first step in introducing a new high trust approach for the output of sequencing and bioinformatics workflows, that simply requires a stated reference database and then builds a taxon table from the provided data.

This allows the taxon coverage to be reported and makes the taxon index searchable but ditches trying to hit the impossible (and overly restrictive target) of validating workflow outputs against a specific taxon database state (that might not exist as a reference snapshot) from one of a large range of possible bioinformatics reference databases.

To begin with, this is simply to get the taxon data read in and exported to the file metadata, allowing test file validation. It will need considerable refinement to add tests and possibly to retire the validated NCBI checking.