FabianDeister / Library_curation_BOLD

GNU General Public License v3.0
1 stars 2 forks source link

assessing sequence quality for non COI seq #25

Open bwprice opened 6 months ago

bwprice commented 6 months ago

Adding this comment here to revisit later. Our current approach (in a BIN and >500 bases) only works for COI sequences and if we use a length criterion for non COI genes we need to make sure it makes sense.

rvosa commented 6 months ago

this probably generalises to the issue that the pipeline will need a configuration system. The general approach for snakemake pipelines is to have a YAML file in config/config.yml where various parameters are defined. This can include the seq length, locations of various files for the pipeline to operate on, logging verbosity, etc.

rvosa commented 6 months ago

Another issue is something we've avoided thinking about that's definitely manifesting: there are records with long total sequences that span only a small part of the COI-5P canonical interval. My sense is that they probably originated in GenBank from longer mitome assemblies. If so, they'd still be filtered out on the basis of poor metadata about the specimen. But that basically makes the seq quality metric just a nuisance that needs to be swamped by other metrics. Better would be to align each sequence against the COI HMM, trim it to the interval, and record that as the length.

FabianDeister commented 6 months ago

I've already thought about this topic. It's probably something we should do separately for our reference library. it can't be scaled directly to everyone. but what you write with HMM is exactly what i have in mind. but for that we don't have to do it for all the data at once, but in systematic groups or types