PoonLab / OpenRDP

An open-source re-implementation of the RDP4 recombination detection program
GNU General Public License v3.0
45 stars 9 forks source link

ValueError thrown when running GENECONV #73

Closed ArtPoon closed 5 months ago

ArtPoon commented 5 months ago

A user reported problems running geneconv on their data, with the following exception thrown:

Loading configuration from /miniconda3/envs/open/lib/python3.8/site-packages/OpenRDP-0.1.0-py3.8.egg/openrdp/default.ini
Starting 3Seq Analysis
Finished 3Seq Analysis
Starting GENECONV Analysis
Traceback (most recent call last):
  File "/miniconda3/envs/open/bin/openrdp", line 4, in <module>
    __import__('pkg_resources').run_script('OpenRDP==0.1.0', 'openrdp')
  File "/miniconda3/envs/open/lib/python3.8/site-packages/pkg_resources/__init__.py", line 722, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/miniconda3/envs/open/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1561, in run_script
    exec(code, namespace, namespace)
  File "/miniconda3/envs/open/lib/python3.8/site-packages/OpenRDP-0.1.0-py3.8.egg/EGG-INFO/scripts/openrdp", line 44, in <module>
    results = scanner.run_scans(args.infile, args.ref)
  File "/miniconda3/envs/open/lib/python3.8/site-packages/OpenRDP-0.1.0-py3.8.egg/openrdp/__init__.py", line 230, in run_scans
    results.dict['geneconv'] = geneconv.execute(infile)
  File "/miniconda3/envs/open/lib/python3.8/site-packages/OpenRDP-0.1.0-py3.8.egg/openrdp/geneconv.py", line 146, in execute
    gc_results = self.parse_output(out_path)
  File "/miniconda3/envs/open/lib/python3.8/site-packages/OpenRDP-0.1.0-py3.8.egg/openrdp/geneconv.py", line 173, in parse_output
    locations = (int(line[4]), int(line[5]))  # Locations in alignment
ValueError: invalid literal for int() with base 10: '1.0'
ArtPoon commented 5 months ago

A trivial fix would be to cast the items from line as float before calling int, but we should probably first determine why the program is returning 1.0

WilliamZekaiWang commented 5 months ago

Think I found the issue.

This is a segment of the output from the .frags file generated by the gencov script.

#   SeBC Sim  BC KA    Aligned Offsets   Num  Num  Tot  MisM
#   NaPvalue  Pvalue   Begin  End   Len  Poly Dif  Difs Pen.
AI  Cc_3>ERS990491;Cc_1_828>ERS990617 0.0465  0.84130     1    222  222    52   2   51    4
AI  Cc_2>ERS990555;Cc_1_1150>ERS990330 0.0465  > 1.0      29    697  669   141   2   12   15

The error comes from the way we split the row of text. Typically, the 4th column would be a float, in the case of the second info line, it is instead > 1.0. When we hold this value in the list we separate > and 1.0 into two different items in the list when it should instead be one. This pushes 1.0 to the next index on the list, which should've been an int, causing the error.

In this case, should I just make it so we ignore the >?

ArtPoon commented 5 months ago

https://github.com/PoonLab/OpenRDP/blob/30b94f8d00527d8f00e94b2574641101d5f88630/openrdp/geneconv.py#L162-L174

We should be able to split on tabs instead of general whitespace (which is what happens when we call split(), which should keep the > and 1.0 together. Then we need to add a check for non-numerical characters and strip them out, i.e., drop the >.

WilliamZekaiWang commented 5 months ago

I was wrong about the file being tab separated. Splitting by tabs didn't separate the string. I did the following and it allowed the geneconv analysis to run:

line[2:] = [item for item in line[2:] if all(char.isalnum() or char == '.' for char in item)]

where the first 2 items in line should be characters

ArtPoon commented 5 months ago

@WilliamZekaiWang to push fix to dev branch for fast review before PR