BioinformaticsToolsmith / Identity

Other
33 stars 3 forks source link

A sequence is too short #8

Open xiekunwhy opened 2 years ago

xiekunwhy commented 2 years ago

Hi,

I was used meshclust3 to cluster repeat sequences (including mite and tir), I got many warnings like "Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.". The sequences length in input file range from 100bp to ~1Mb (some times range from 20bp to 3Mb).

Will these warnings affect results, and how to avoid these warnings?

Best, Kun

hani-girgis commented 2 years ago

Thanks for trying out MeShClust3.

First, I believe 20 bp is too short to be a MITE. I would recommend removing short sequences (perhaps < 50 base pairs). Second, I would recommend sorting the input sequences by length. Then I'd divide them into groups (< 1000, 1000–5000, 5000-10000, etc). The interval size does not need to be 5k bp; it can be 100k bp or longer. After that, I'd cluster each group separately.

Please keep me posted.

Best regards.