edawson / rkmh

Classify sequencing reads using MinHash.
MIT License
48 stars 4 forks source link

Prevelance python script only useful for hpv16? #16

Open jolvany opened 3 years ago

jolvany commented 3 years ago

HiEric!

I was recently recommended your program, and I am trying to figure out the best method to utilized and understand this process for my end goal. So I am effectively trying to use sequencing reads to detect malaria in sub-Saharan Africa. In this region there are four possible species of parasite these reads could originate from- and what I'm trying to do, through the language of your program, is classify the reads using four references and either count the reads for each species or use your score_real_classification.py to get percentages.

Here is the command I used to classify since "classify" is currently unavailable:

rkmh stream -r PlasmoDB-51_PfalciparumCD01_Genome.fasta -r PlasmoDB-51_PvivaxP01_Genome.fasta -r PlasmoDB-
51_PmalariaeUG01_Genome.fasta -r PlasmoDB-51_PovalecurtisiGH01_Genome.fasta  -f NWD104090.fasta -s 1000 -k 10 > out.rk 

This did produce a viable out file (I can see output), but when I try to tabulate through your python scripts these are the errors I get:


python scripts/score_real_classification.py < out.rk > out.cls
scripts/score_real_classification.py:60: SyntaxWarning: "is not" with a literal. Did you mean "!="?
  if l_match is not "" and s_match is not "" and l_match is not s_match[0]:
scripts/score_real_classification.py:60: SyntaxWarning: "is not" with a literal. Did you mean "!="?
  if l_match is not "" and s_match is not "" and l_match is not s_match[0]:
scripts/score_real_classification.py:70: SyntaxWarning: "is not" with a literal. Did you mean "!="?
  if l_match is not "":
scripts/score_real_classification.py:73: SyntaxWarning: "is not" with a literal. Did you mean "!="?
  if s_match is not "":
Traceback (most recent call last):
  File "scripts/score_real_classification.py", line 21, in <module>
    read_len = int(tokens[2].strip().split("/")[1])
IndexError: list index out of range

And I am wondering if you could shed light on this error? Is it because it is designed to be for the hpv16 usage and I will have to write my own tabulation script? Or do I need to change something so it can recognize the available species in my input?

I can email you an example of the out.rk file but even when I make it .txt it won't let me attach it here for some reason. But here is the species breakdown: Anything with Pv is Plasmodium vivax Anything with Pf Plasmodium falciparum Anything with Pm Plasmodium malariae Anything with Po Plasmodium ovale

And finally if your answer is that I will need to write my own script, is the out.rk file supposed to be understood as: [Reference it mapped to] [Read] [kmers mapped] [sketch size]?

Thank you in advance!!