Dill-PICL / GOMAP-singularity

GOMAP-Singularity is the containerized version of GOMAP
http://gomap.blunderingbioinformatics.org/
MIT License
11 stars 5 forks source link

Domain Step Fails (Error in setnames(x, value): Can't assign 15 names to a 0 column data.table) #13

Closed Thyra closed 5 years ago

Thyra commented 5 years ago

Describe the bug The domain step fails on some legumes after about 15mins.

Input File

>phavu.G19833.gnm2.ann1.Phvul.L001612.1 pacid=37142049 transcript=Phvul.L001612.1 locus=Phvul.L001612 ID=Phvul.L001612.1.v2.1 annot-version=v2.1
MLETAEEPEFLMADMSPEQLSSFAAYKAKLNAIRQSEKEKSIEKALKDAGLGHREVTPLMKLRVVGLTYKTRQDKPKEGI
VTIWNPIEKQLLELVEGGAYAVAGLMPSSSDFDILQLHARGSCTKWLPLSSNAREQFRPFFRRRKSTPLSSLGDIPLSNE
FDIAAYVVHVGRVYTSNQQKKQWVFVTDGSIMNGLQSEKLINSLLAICFCSPLIDHDSSFPLFNYNLAGSTVGLCNLIKK
EKDHTNHIWVADANENSAYYLNFDSSNCSHLRNAASSIRRWAYNSLLIIEKLKEKVLHVVGDDCKA
>phavu.G19833.gnm2.ann1.Phvul.010G034400.1 pacid=37142050 transcript=Phvul.010G034400.1 locus=Phvul.010G034400 ID=Phvul.010G034400.1.v2.1 annot-version=v2.1
MGGGGGEEGNNLEFTPTWVVAVVCSVIVSASFAAERFLHYGGTFLKKKNQKPLFEALLKIKEELMLLGFISLLLTVTQNG
IIKICVPESWTRHMLPCSLKDKEELESAKLTSHFQTFFSFTDIPATVRHLLAENENEDHQSGEKLGHCAKKGRVPLLSVE
ALHHLHIFIFVLAIVHVTFCVLTVVFGGLKIRQWKHWENSIVDENNRKQPVLESIVTHVHEHAFIQNHFTGFGKDYAVLG
WLKSFFKQFYGSVTKLDYVTLRLGFIMTHCRGNPKFNFHKYMIRALEDDFKQVVGISWYLWIFVVIFMLLNVHGWHTYFW

GOMAP step that crashed (if applicable) domain

Attach the output files cmb-domain.log

System Details condo

Additional context Failing legumes (so far) are at /work/dillpicl/dpsaroud/GOMAP-legumes/data/common_bean and /work/dillpicl/dpsaroud/GOMAP-legumes/data/medicago1 /work/dillpicl/dpsaroud/GOMAP-legumes/data/soybean domain step ran through without any errors.

wkpalan commented 5 years ago

@Thyra: I have worked on a dev version (v1.1-dev). The container is built at (https://www.singularity-hub.org/collections/1176) Please check if this works and let me know. It seems to work on my test machine, but I didn't get to check MPI issues yet. I am traveling till the end of the week.

Thyra commented 5 years ago

@wkpalan Thanks, I'll give it a try! Condo is pretty busy atm so it'll take a while until I get my jobs to run but I'll let you know as soon as I have news.

Thyra commented 5 years ago

@wkpalan I tried it with the v1.1-dev branch of the repo and the GOMAP-singularity:v1.1.condo container and that failed at setup: setup.log Now I'm doing the master branch of the repo with the GOMAP-singularity:v1.1.condo container and that seems to run. I'll keep you posted.

Thyra commented 5 years ago

I ran a fresh install of the master branch with the GOMAP-singularity:v1.1.condo image and it failed with

Traceback (most recent call last):
  File "./gomap.py", line 57, in <module>
    config = init_dirs(config)
  File "/opt/GOMAP/code/utils/basic_utils.py", line 40, in init_dirs
    os.makedirs(gomap_dir, mode=0777)
  File "/usr/lib/python2.7/os.py", line 157, in makedirs
    mkdir(name, mode)
OSError: [Errno 17] File exists: '/workdir/data/medicago1/GOMAP-medicago1'

You can see the installation at /work/dillpicl/dpsaroud/GOMAP-devMaster/. Input and slurm scripts and logs are at data/<plant>. Interestingly common_bean failed immediately while medicago1 stayed in zombie mode until timeout even though they both seem to have encountered the same problem.

wkpalan commented 5 years ago

@wkpalan I tried it with the v1.1-dev branch of the repo and the GOMAP-singularity:v1.1.condo container and that failed at setup: setup.log Now I'm doing the master branch of the repo with the GOMAP-singularity:v1.1.condo container and that seems to run. I'll keep you posted.

I guess this is the first time anyone is testing the dev version of the pipeline. To test the dev pipeline you would need two repositories 1) GOMAP-singularity git clone -b v1.1-dev git@github.com:Dill-PICL/GOMAP-singularity.git 2) GOMAP codebase cd GOMAP-singularity && git clone -b v1.1-dev git@github.com:Dill-PICL/GOMAP.git

If these two were done then the dev environment is setup. I forgot to provide proper instructoins

wkpalan commented 5 years ago

I ran a fresh install of the master branch with the GOMAP-singularity:v1.1.condo image and it failed with

Traceback (most recent call last):
  File "./gomap.py", line 57, in <module>
    config = init_dirs(config)
  File "/opt/GOMAP/code/utils/basic_utils.py", line 40, in init_dirs
    os.makedirs(gomap_dir, mode=0777)
  File "/usr/lib/python2.7/os.py", line 157, in makedirs
    mkdir(name, mode)
OSError: [Errno 17] File exists: '/workdir/data/medicago1/GOMAP-medicago1'

You can see the installation at /work/dillpicl/dpsaroud/GOMAP-devMaster/. Input and slurm scripts and logs are at data/<plant>. Interestingly common_bean failed immediately while medicago1 stayed in zombie mode until timeout even though they both seem to have encountered the same problem.

Can you send the link to the medicago file the common bean seems to work fine now on condo?

Thyra commented 5 years ago

Thanks, I'll give it a try! The medicago files are at /work/dillpicl/dpsaroud/GOMAP-legumes/data/medicago1 (the input file is called medicago1-input.fa )

Thyra commented 5 years ago

I made a mistake and started it again from the code master branch and GOMAP-singularity:v1.1.condo image but now common bean seems to run fine. Medicago threw this error: The input sequences contain non IUPAC amino acid characters. Here is the input file, a short script I wrote to find the sequences that have non-IUPAC characters and the script's output:

medicago-non-IUPAC-symbs.tar.gz

There are indeed some sequences that biopython thinks contain non-IUPAC characters, for example:

>medtr.A17_HM341.Medtr0006s0140 medtr.A17_HM341.Medtr0006s0140.1 medtr.A17_HM341.Medtr0006s0140.1 LRR and NB-ARC domain disease resistance protein
MEILISVVAKIAEYTVVPFGRQASYLIFYKGNFKTLKDNVEDLEATRERMNHLVEGETQN
GKVIEKDVLNWLEKVNEVIEKANGLQNDPRNANVSCSAWPFPNLILRHQLSRKATKILKD
VVQVQGKGIFDQVGYLPPLDVVASSSTRDREKYDTRESLKEDIVKALADSTSCNIGVYGL
GGVGKTTLVEKVAQIAKEHKLFDRVVETEVSKNQDIKRIQGEIADSLGLRLEEETNRGRA
ERLRQRIKMEKSILIILDNIWTILVLKEVGIPVGDEHNGCKLLMTSRDQEVLLQMDVPKE
FTFKVELMSENETWSLFQFMAGDVVKDSNLKDLPFQVARKCEGLPLRVVXHHHPSRFLML
NFRLLTNQIGIHKL

The problem is the X towards the end of the second-last line. It is listed in the IUPAC codes as "any amino acid" but Biopython doesn't include it in its alphabet:

>>> from Bio.Alphabet import IUPAC
>>> IUPAC.protein.letters
'ACDEFGHIKLMNPQRSTVWY'

So I guess this comes down to another question of: How do we want GOMAP to behave when it encounters that X? I don't really know enough about the tools and what difference that makes to them, what do you think?

I've resubmitted another condo job to use the v1.1-dev branch and the code from v1.1-dev git@github.com:Dill-PICL/GOMAP.git but I don't think that should make a difference.

Thyra commented 5 years ago

Yes, the same message comes up with the v1.1-dev branch.

wkpalan commented 5 years ago

Yes, the same message comes up with the v1.1-dev branch.

Do you mean it fails with non IUPAC characters?

wkpalan commented 5 years ago

I have updated the v1.1-dev branch to use "Bio.Alphabet.IUPAC.ExtendedIUPACProtein" to allow the use of X. I am not sure how the different tools will deal with this, but a simple missing amino acid should not create any issues with the methods we have been using so far

Thyra commented 5 years ago

I'll give it a try, thank you! It seems that the other tools at least are running through even though I ofc don't know whether that X had an influence on the results (e.g. does blast understand it could be any amino acid or does it specifically look for an X). But I agree, allowing it is probably the best we can do, taking it out and changing the sequence would certainly be a bad idea.