Closed evolgenomology closed 6 years ago
Hi Phil, it's not that it doesn't like the ID ... It can't find the ID in the sequence ID file.
I would assume the ID is slightly different in the sequence ID file and therefore the script doesn't know from which proteome/taxon the protein originated. Check the following, does it produce an output?
grep 'EUMMAC|TRINITY_DN1388_c0_g2__TRINITY_DN1388_c0_g2_i1__g.1344__m.1344' ./SequenceIDs.orig.no145.txt
cheers,
do,
Hi Dom,
thanks. I guess it is OrthoFinder changing “:” to “_” then, right?
43_388:
EUMMAC|TRINITY_DN1388_c0_g2::TRINITY_DN1388_c0_g2_i1::g.1344::m.1344
TRINITY_DN1388_c0_g2::TRINITY_DN1388_c0_g2_i1::g.1344 ORF type:internal
len:167 (+) TRINITY_DN1388_c0_g2_i1:2-499(+)```
What do you suggest to do? Me changing the IDs or do you want to modify the
script?
Also, why is kinfin itself fine with this? I guess because it is just
counting, not searching, right?
Cheers
P
On 2 May 2018 at 13:03:53, Dominik R Laetsch (notifications@github.com)
wrote:
grep 'EUMMAC|TRINITY_DN1388_c0_g2__TRINITY_DN1388_c0_g2_i1__g.1344__m.1344'
./SequenceIDs.orig.no145.txt
PS Just seeing there is more to it. The dictionary is only created using the identifier until the first space, so these long ones don’t work at all. Uh, oh, what to do … ???
On 2 May 2018 at 13:09:50, Philipp Schiffer (philipp.schiffer@gmail.com) wrote:
Hi Dom,
thanks. I guess it is OrthoFinder changing “:” to “_” then, right?
43_388:
EUMMAC|TRINITY_DN1388_c0_g2::TRINITY_DN1388_c0_g2_i1::g.1344::m.1344
TRINITY_DN1388_c0_g2::TRINITY_DN1388_c0_g2_i1::g.1344 ORF type:internal
len:167 (+) TRINITY_DN1388_c0_g2_i1:2-499(+)```
What do you suggest to do? Me changing the IDs or do you want to modify the
script?
Also, why is kinfin itself fine with this? I guess because it is just
counting, not searching, right?
Cheers
P
On 2 May 2018 at 13:03:53, Dominik R Laetsch (notifications@github.com)
wrote:
grep 'EUMMAC|TRINITY_DN1388_c0_g2__TRINITY_DN1388_c0_g2_i1__g.1344__m.1344'
./SequenceIDs.orig.no145.txt
In the main Kinfin script I have tried to emulate the behaviour of OrthoFinder, by doing the same replacements they do ...
But in general it is easiest to sanitise the headers into something that has only alphanumeric characters and dots.
Hi Dom
I got an error with the
get_count_matrix
script. It has a problem with a protein ID in a dictionary it uses. Could you maybe point me to why this might be (I think there are many protein IDs like this in the clustering, and the generalkinfin
runs without problems with these data.Cheers
Phil