DRL / kinfin

Taxon-aware analysis of clustered protein sequences
https://kinfin.readme.io
GNU General Public License v3.0
30 stars 12 forks source link

get_count_matrix error #29

Closed evolgenomology closed 6 years ago

evolgenomology commented 6 years ago

Hi Dom

I got an error with the get_count_matrix script. It has a problem with a protein ID in a dictionary it uses. Could you maybe point me to why this might be (I think there are many protein IDs like this in the clustering, and the general kinfin runs without problems with these data.

get_count_matrix.py -g ./Orthogroups.txt -c ./config.orig.modi.taxid.txt -s ./SequenceIDs.orig.no145.txt -o ./I3.5_kinfinfin_with_tree1_prots_ipr_wntcount -t ./DME.wnts.IDs -f /SAN/telfordlab/xenogenomics/kinfin/I_3.5_sp25_corrected_specs_names/renamed/all_specs.functional_annotation.txt
[+] Start ...
[+] Parsing ./config.orig.modi.taxid.txt ...
[+] Parsing ./SequenceIDs.orig.no145.txt ...
[+] Parsing ./DME.wnts.IDs ...
[+] Parsing ./Orthogroups.txt ...
Traceback (most recent call last):
  File "../scripts/get_count_matrix.py", line 317, in <module>
    dataCollection = DataCollection(args)
  File "../scripts/get_count_matrix.py", line 135, in __init__
    self.parse_groups_f()
  File "../scripts/get_count_matrix.py", line 294, in parse_groups_f
    taxon_counter = Counter([self.taxon_id_by_protein_id[protein_id] for protein_id in clusterObj.protein_ids])
KeyError: 'EUMMAC|TRINITY_DN1388_c0_g2__TRINITY_DN1388_c0_g2_i1__g.1344__m.1344'

Cheers

Phil

DRL commented 6 years ago

Hi Phil, it's not that it doesn't like the ID ... It can't find the ID in the sequence ID file.

I would assume the ID is slightly different in the sequence ID file and therefore the script doesn't know from which proteome/taxon the protein originated. Check the following, does it produce an output?

grep 'EUMMAC|TRINITY_DN1388_c0_g2__TRINITY_DN1388_c0_g2_i1__g.1344__m.1344' ./SequenceIDs.orig.no145.txt

cheers,

do,

evolgenomology commented 6 years ago

Hi Dom,

thanks. I guess it is OrthoFinder changing “:” to “_” then, right?


43_388:
EUMMAC|TRINITY_DN1388_c0_g2::TRINITY_DN1388_c0_g2_i1::g.1344::m.1344
TRINITY_DN1388_c0_g2::TRINITY_DN1388_c0_g2_i1::g.1344  ORF type:internal
len:167 (+) TRINITY_DN1388_c0_g2_i1:2-499(+)```

What do you suggest to do? Me changing the IDs or do you want to modify the
script?
Also, why is kinfin itself fine with this? I guess because it is just
counting, not searching, right?

Cheers

P

On 2 May 2018 at 13:03:53, Dominik R Laetsch (notifications@github.com)
wrote:

grep 'EUMMAC|TRINITY_DN1388_c0_g2__TRINITY_DN1388_c0_g2_i1__g.1344__m.1344'
./SequenceIDs.orig.no145.txt
evolgenomology commented 6 years ago

PS Just seeing there is more to it. The dictionary is only created using the identifier until the first space, so these long ones don’t work at all. Uh, oh, what to do … ???

On 2 May 2018 at 13:09:50, Philipp Schiffer (philipp.schiffer@gmail.com) wrote:

Hi Dom,

thanks. I guess it is OrthoFinder changing “:” to “_” then, right?


43_388:
EUMMAC|TRINITY_DN1388_c0_g2::TRINITY_DN1388_c0_g2_i1::g.1344::m.1344
TRINITY_DN1388_c0_g2::TRINITY_DN1388_c0_g2_i1::g.1344  ORF type:internal
len:167 (+) TRINITY_DN1388_c0_g2_i1:2-499(+)```

What do you suggest to do? Me changing the IDs or do you want to modify the
script?
Also, why is kinfin itself fine with this? I guess because it is just
counting, not searching, right?

Cheers

P

On 2 May 2018 at 13:03:53, Dominik R Laetsch (notifications@github.com)
wrote:

grep 'EUMMAC|TRINITY_DN1388_c0_g2__TRINITY_DN1388_c0_g2_i1__g.1344__m.1344'
./SequenceIDs.orig.no145.txt
DRL commented 6 years ago

In the main Kinfin script I have tried to emulate the behaviour of OrthoFinder, by doing the same replacements they do ...

But in general it is easiest to sanitise the headers into something that has only alphanumeric characters and dots.