bhattlab / MGEfinder

A toolbox for identifying mobile genetic element (MGE) insertions from short-read sequencing data of bacterial isolates.
MIT License
105 stars 16 forks source link

filtering used to create *.all_seqs.fna file? #33

Closed tracylnicholson closed 2 years ago

tracylnicholson commented 2 years ago

In our data set there seems to be some filtering that occurred between the 01.clusterseq..tsv and the .all_seqs.fna files. Even after the 01.clusterseq..tsv file was filtered to remove duplicates based on the sequence inference method, it contains far more MGE sequences than the .all_seqs.fna file. Could you direct me where to look to find the filtering information that was used to create the *.all_seqs.fna file? My main concern is that the 01.clusterseq..tsv file contains a high number of potentially false positive MGE sequences. Thanks in advance for your help!

durrantmm commented 2 years ago

I think this should answer your question #30.

In summary, you shouldn't use the clusters in 01.clusterseq.tsv, a filtering step is applied to remove false positives. It's really just an intermediate file.

As discussed in the paper, additional MGE filtering may be necessary, depending on your goal. If you aren't especially interested in de novo MGE discovery, you can only keep MGEs that contain known transposases, recombinases, etc.