RTRichar / MetaCurator

GNU General Public License v3.0
8 stars 2 forks source link

Losing taxa after curation #8

Closed npechl closed 5 months ago

npechl commented 5 months ago

Hi @RTRichar,

Thank you for developing MetaCurator. I am using MetaCurator to build a reference database on ITS1, and I have a couple of questions.

  1. First, I am unsure about the selection of representative reference sequences. Is choosing 10 sequences at random from my input .fasta file an appropriate way to follow?
  2. Second, I have noticed that many species present in my input data are missing from the curated database (the same also applied for the example that you provide in the release version folder TestMetaCurator). How can I ensure that these species are accurately represented in the final curated sequences and taxonomy?

Thank you in advance for your time!

Nikos

RTRichar commented 5 months ago

Hi Nikos,

Thanks for your interest in using this software.

For the selection of reference sequences, the important thing is that they are trimmed precisely to the amplicon you are sequencing (including the primer region). Beyond that, it's theoretically ideal to use a set that are phylogenetically diverse so that HMMs are optimally representative from the start. In practice, however, I've not seen huge effects between different sets of input references.

For question 2, this is not necessarily unexpected. Some input sequences are junk and, even with non-erroneous sequences, there are no curation methods with 100 percent sensitivity for identifying the amplicon region of interest, especially for a more poorly conserved region like ITS1. One thing I'd recommend is to play around with -is, -cs and perhaps -e options.

Best, Rodney

npechl commented 5 months ago

Thank you for your quick response. I ll keep experimenting with MetaCurator and come back to this in case I have further questions.