ML-Bioinfo-CEITEC / genomic_benchmarks

Benchmarks for classification of genomic sequences
Apache License 2.0
106 stars 12 forks source link

masked DNA strings #32

Open kchu25 opened 1 year ago

kchu25 commented 1 year ago

There are some DNA strings in the datasets that either partially or entirely consist of masked strings, e.g., the 7th sequence in the DemoHumanOrWorm training set (checked via dset[6]), is a string of 'NNNNNNN....NNNN'. Maybe consider extracting the DNA strings from the unmasked genome?

simecek commented 1 year ago

I believe we use unmasked genome but I will look into that. It might still be that we hit the beginning / end of chromosomes that are often unknown. Maybe we should check the randomly chosen sequences and remove long all Ns.