OstfriesenBI / PredmiRNA

A set of scripts and tools to train a classifier for pre-miRNA Recognition
1 stars 0 forks source link

Feature calculation: Percentage low complexity regions detected in the sequence #13

Closed Finesim97 closed 5 years ago

Finesim97 commented 5 years ago

Programm call and Python/R script Input: fasta file with the sequences Output: csv file with the sequence identifier and the percentage of the sequence, that is low complexity. The program dustmasker should be used for this (score threshold for sub windows set to 15). It writes the low complexity regions in lower case in the fasta file. The number of lower case letters has to be divided by the length of the sequence. This should be done with a R or python function. Dustmasker can be installed using conda

"comment",low complex
"mmu-mir-380 MI0000797 Mus musculus miR-380 stem-loop", 0.2
"mmu-mir-381 MI0000798 Mus musculus miR-381 stem-loop", 0.4

Source Paper: HuntMi: an efficient and taxon-specific approach in pre-miRNA identification

Finesim97 commented 5 years ago

First focus on the frequencies of lower case letters in the csv sequence file!