RasmussenLab / phamb

Downstream processing of VAMB binning for Viral Elucidation
MIT License
44 stars 8 forks source link

Parsing deepvirfinder line 512, in _parse_dvf_row contig_name, length, score, pvalue = line[:-1].split() #43

Closed TomasaSbaffi closed 1 year ago

TomasaSbaffi commented 2 years ago

Hello,

I am really happy to be trying the PHAMB pipeline on my data. I am running it on small co assemblies, I do not have a concatenated assembly but I am running the pipeline separately for each coassembly. Is this a wrong approach?

When I run the RF model I have the following error, given by python:

Parsing deepvirfinder
Traceback (most recent call last):
  ...
  File "path/to/phamb/workflows/mag_annotation/scripts/run_RF_modules.py", line 512, in _parse_dvf_row
    contig_name, length, score, pvalue = line[:-1].split()
ValueError: too many values to unpack (expected 4)`

The head of my clusters.tsv

1   k141_169383 flag=1 multi=4.0000 len=2138
2   k141_566141 flag=1 multi=5.0000 len=1337
3   k141_562874 flag=1 multi=3.0000 len=2128
4   k141_174278 flag=1 multi=3.0000 len=1243
5   k141_155879 flag=1 multi=4.0000 len=1035
6   k141_981516 flag=0 multi=7.5058 len=1355
7   k141_615867 flag=1 multi=3.0000 len=1068
8   k141_749989 flag=1 multi=4.0000 len=1960
9   k141_945068 flag=0 multi=15.6210 len=2455
10  k141_1091919 flag=0 multi=5.9626 len=1318

the head of my all.DVF.predictions.txt

name    len score   pvalue
k141_344865 flag=1 multi=4.0000 len=1127    1127    6.64381843762385e-07    0.8834881788654733
k141_620757 flag=0 multi=3.7828 len=1260    1260    0.061418987810611725    0.2213724601556009
k141_298883 flag=1 multi=3.0000 len=1290    1290    0.013160040602087975    0.3235138605634867
k141_390848 flag=1 multi=2.0790 len=1179    1179    0.6529936790466309  0.036823022886924996
k141_206919 flag=0 multi=10.9103 len=1479   1479    1.0 0.0
k141_505802 flag=1 multi=25.0000 len=1881   1881    0.08912927657365799 0.196616058614699
k141_1057576 flag=1 multi=3.0000 len=1049   1049    0.635226845741272   0.038635848629050534
k141_896644 flag=0 multi=200.6066 len=1872  1872    0.9405460357666016  0.01478585995921142
k141_1034585 flag=0 multi=3.0000 len=1245   1245    0.9999510645866394  0.0011518996903089357

Is it due to the 4 columns composing the name of the contigs? Any suggestions?

Thanks again for the great pipeline!

joacjo commented 1 year ago

Hi @TomasaSbaffi

Thanks for trying out Phamb! If you ran Vamb seperately for each coassembly, it makes sense to run Phamb seperately for each coassembly as well.

Now to your problem: It is the naming of your contigs that produce the error, specifically the "spaces" in the fasta header. I would recommend renaming your contigs and replace spaces with "_" not only to make this parsing script work but many other bioinformatic tools do not work properly with spaces in fasta headers either.

The name change should look like this: k141_1091919 flag=0 multi=5.9626 len=1318 -> k141_1091919_flag=0_multi=5.9626_len=1318

I Best, Joachim

TomasaSbaffi commented 1 year ago

Thank you very very much!!