BreakerLab / dimpl

DIMPL: Discovery of Intergenic Motifs PipeLine
MIT License
3 stars 3 forks source link

fixcoords job runs until time expires #17

Closed ggaffield closed 3 years ago

ggaffield commented 3 years ago

The fixcoords job has two parts: fix_sto_coords.sh fix_tblout_coords.sh

fix_sto_coords.sh is the one that appears to be getting stuck.

ggaffield commented 3 years ago

Actually, fix_tblout_coords.sh runs first and runs fine. fix_sto_coords.sh is having trouble processing the output of the {name}.cm.align.sto file, which created by cmsearch. cmsearch is detecting duplicate accession numbers during its search so it makes them unique by prepending "####|" to the accession number. fix_sto_coords.sh crashes because it cannot handle the "|" character in the accession number. This should be fixed. But, it does bring up the question as to why there are duplicate accession numbers in the first place. I suspect we may have a duplicate IGR in the dataset cmsearch is searching through.

ggaffield commented 3 years ago

The env12 dataset (refseq98env12.s50.igr.fasta) does have duplicates/overlaps. 19,556 to be specific. And this takes the 50nt slop factor into account. So overlaps in the 50nt region on either end aren't counted. Whether duplicates and overlaps should be in the dataset can be debated. However, I feel that the code should handle the situation more robustly, regardless of the dataset's contents.

ggaffield commented 3 years ago

Upon further anaylsis, I found that the env12 dataset does have some duplicate accnos. 52,737 to be exact. The duplicates came from a single fasta file (2013338003.fna) that was imported once into env6, and then again into env7. This never caused a problem for multiravenna for some reason. I suspect it's because of the way multiravenna performs an Infernal search. It breaks the single large search job up into several smaller ones. Each job only searches a single sub-fasta file of the larger dataset. So each individual Infernal searches on env12 didn't run into this duplicate accno problem, because it never searched env6/2013338003.fna and env7/2013338003.fna at the same time.

Solution: I removed the 52,737 entries from the "refseq98env12.s50.igr.fasta" file that came from env6. This does not require an changes to the code base. The change was made on the Farnam cluster where the file resides.

However, it might be a good idea to modify the code to handle this case in the future, should someone try to use it on a custom database that just happens to have duplicates.

kenibrewer commented 3 years ago

Yeah. I think correcting the problems with the fix_coords scripts is a good idea. Do those scripts need to be re-written in python or will it be possible for a shell script to handle the "|" character?

ggaffield commented 3 years ago

Yes, I fixed the fix_sto_coords script to handle the "|" character. I'm doing a system test now to see how this affects things downstream.