Consensus COI reference sequences for major groups in BOLD

I'm interested in developing consensus COI reference sequences for major groups in BOLD:

I'm guessing I can parse the database informatically to get sequences - and then do alignments that generate a consensus per major group either at the command line or maybe using something like Geneious.

To provide balanced representation in generating a consensus sequence, given available sequences per major group or phylum - I was thinking within a major group I can assign all species to their families - and can randomly select one representative sequence per family - and then generate a phylum consensus based on the family sequences in alignment. However, when less than 10 sequences are available doing this, I might add sequences within families, with one sequence per genus to bring things to a minimum of 10 for the phylum or larger group. Or something like this...

I plan to check all MGE extracted COI sequences against BOLD for final identification. I thought to generate the different consensus sequences across biodiversity to ensure MGE extraction of each/all species (target and unknown contaminations for example) cryptically present in a set of reads. Basically, the idea was to have many diverse reference sequences for MGE to work with to help ensure identification of all species via COI diversity in a given Illumina dataset. But maybe so many specific consensus sequences is overkill for what is need for general MGE extraction of all species COIs - given that final identification will be against BOLD and not based on the reference used by MGE for extraction. With some of the marine samples we have, it seems like diverse cryptic contaminating species are a common issue in the Illumina data per target species we intended to sequence.

I was wondering if you have recommended guidelines or suggestions. One specific question I have is how many sequences were used to generate each of your consensus sequences? Also, I noticed you have sub-phylum level consensus sequences in arthropods and in chordates - would it be better to do things at the class rather than phylum level?

Thank you very much! Eric

Hi Eric, I'll try to address your questions sequentially:

1.) Yes, you can retrieve sequence information from BOLD and then align these sequences (command line, Geneious, AliView,...) and create your consensus references. Since you need amino acid sequences for this step, BOLD might not be the best database - they provide protein sequences, but I don't know how easy you can access these bioinformatically. It might be more convenient to use a protein sequence database directly such as Uniprot or the NCBI protein db.

2.) Your approach to incorporate one sequence per family into alignments spanning a phylum sounds in principle good to me, but 10 sequences per phylum are not enough. If possible, include several hundreds up to ~1000, depending on the representation of the major group/phylum in your database. If you have less sequences available, you can include all/one per species.

3.) Assuming that you have one Illumina library of your target organism and suspect contaminating species due to e.g. parasitism (is that correct?), then the question is how closely related are your target and contaminating species. Let's say you have an arthropod target species and an arthropod parasite, then you can probably extract all COI reads with the same arthropod reference, but you cannot trust the one reconstructed COI gene sequence because all reads were incorporated. Having more specific references could help to differentiate between target and contaminating organism, but again, this depends on their degree of relatedness and you should check your output.

4.) I would need to look up precise numbers, but typically, I include 300 - 1300 sequences if available for my taxon of interest. Generally, I would prefer the most specific references possible because (i) your reference is better (more easy to find a good consensus for e.g. a class than a phylum) and (ii) the reference is more close to your reads in the Illumina library. So yes, if possible, go for class level references.

To give some further in-depth advice, I would need more information about the data set you want to analyze. How many Illumina libraries, what is the taxonomic range of the target organism, which kind of contamination is expected in which taxonomic range?

Hope this helps, let me know if you have further questions, Marie

Your advice is super helpful - thank you!

We have genome assemblies for over 150 eukaryotic species - mostly marine inverts - many of them multiple species within the same genus or family - not sure who belongs to which run or library, as it was outsourced but might be able to find out. Sequencing was often done with species of the same genus on same run - and I am being told there is commonly 1-2% contamination between samples on the same run for work done with my samples and for others using the same company. We also found significant amounts of dog in one marine invert assembly - so assuming things could be diverse in artifacts and also could be closely-related, which sounds like it will be challenging. I will try to automate generation of consensus sequences for all BOLD and/or NCBI COI barcodes based on alignments of 300-1000 sequences (species) at the lowest phylogenetic level done to genus and with phylum being the worst case scenario. I can then combine the consensus sequences with ones available here non-redundantly and use the consensus barcode gene set as reference for MitoGeneExtractor to work with.

Or I'm thinking to try something like this...

Hi, I am not sure if understood you correctly, but if you mean that you want to include the consensus sequences mined with MitoGeneExtractor in the reference generation, then this is not possible. References must be consensus sequences on amino acid level, but the gene consensus sequences i.e., the output of MitoGeneExtractor is on nucleotide level.

Anyway, your approach to work with the lowest taxonomic level (i.e., genus) is a good idea. Please note that if you have contamination from a congeneric species, it's very likely that you will extract these reads as well. If it is really only 1-2% contaminating reads, then you consensus sequence is probably fine, but check you data carefully, especially in the presence of ambiguous nucleotide (indicated as N' in the consensus).

Let me know if you need further input, Best wishes, Marie

cmayer / MitoGeneExtractor

Consensus COI reference sequences for major groups in BOLD #10