flass / pantagruel

a pipeline for reconciliation of phylogenetic histories within a bacterial pangenome
GNU General Public License v3.0
46 stars 7 forks source link

Step 7 (reconciliation) ending with no output? #39

Closed MartinezRuiz-Carlos closed 4 years ago

MartinezRuiz-Carlos commented 4 years ago

Hello, I am running the pipeline on 9 archaea genomes (4 focal and 5 from NCBI). I have ran up to step 7. It does not fail, in fact it claims it has completed, but it runs pretty much instantly and it looks like it generates empty outputs. This the the stdout I get:

1   2020-07-14  using ALE software (version v1.0) compiled from source; code origin: https://github.com/ssolo/ALE; code version 265fc4de061f47a4f38c51dc9cfc7a3dda05654e    ale_fullgenetree_dated_1
parsing ALE scenarios
Successfully parsed ALE scenarios

# Parsed reconciliation collection details:
1   2020-07-14  ale_fullgenetree_dated_1_parsed_1
Resume mode: first clean the database from previous inserts and indexes
currently set variable:
database='/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/03.database' dbfile='/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/03.database/db_sc3' parsedreccolid='1'
Successfully cleaned the database from previous inserts and indexes
Storing reconciliation parameters and load parsed reconciliation data into database
currently set variable:
database='/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/03.database' dbfile='/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/03.database/db_sc3' parsedrecs='/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/07.reconciliations/parsed_recs/ale_fullgenetree_dated_1_parsed_1' ALEversion='' ALEalgo='ALEml' ALEsourcenote='using ALE software (version v1.0) compiled from source; code origin: https://github.com/ssolo/ALE; code version 265fc4de061f47a4f38c51dc9cfc7a3dda05654e' parsedreccol='ale_fullgenetree_dated_1_parsed_1' parsedreccolid='1' parsedreccoldate='2020-07-14'
Successfully Stored reconciliation parameters and load parsed reconciliation data into database
0 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/07.reconciliations/parsed_recs/ale_fullgenetree_dated_1_parsed_1/summary_gene_tree_events_minfreq0.1
0 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/07.reconciliations/parsed_recs/ale_fullgenetree_dated_1_parsed_1/summary_gene_tree_events_minfreq0.25
0 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/07.reconciliations/parsed_recs/ale_fullgenetree_dated_1_parsed_1/summary_gene_tree_events_minfreq0.5
Successfully reconciled gene trees with ALE
Pantagruel pipeline task 7: complete.

Al three of the files in ale_fullgenetree_dated_1_parsed_1 are empty. If I then try to run step 8 it fails with the error

ortholog_collection_1
building matrix of gene presence / absence for 9 genomes
examining a total of 0 CDSs with non-ORFan family assignment
retrieveing orthology classification from collection: ortholog_col_id=1
Traceback (most recent call last):
  File "/pantagruel/scripts/get_ortholog_presenceabsence_matrix_from_sqlitedb.py", line 131, in <module>
    singleton = writefamrowout(currphyloprof, curfamog, lgenomecodes, dfoutmatortho, lcurrt)
  File "/pantagruel/scripts/get_ortholog_presenceabsence_matrix_from_sqlitedb.py", line 7, in writefamrowout
    if curfam[1] is not None: scurfam = '%s-%d'%curfam
TypeError: 'NoneType' object has no attribute '__getitem__'

Which I guess stems from an empty input. Any ideas where this is stemming from? Step 5 seems to have produced species trees. Sorry if this turns out to be something obvious, at first it looked to me like it, but I cannot put my finger on it. Thanks!

flass commented 4 years ago

Hi Carlos,

no worries, it is likely to be something NOT obvious! it does look like something has gone wring with the ALE reconciliation run. You would get more information about that in the specific logs that should be located in the folder /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/logs/ALE/.

Could you please provide a bit more context with those logs and possibly the full pipeline-level log for task 07?

I suspect it is something to do with the ALE executables not being found by the wrapper script (even though it seems to be detected by the pipeline, as its showing the ALE version tag).

Best, Florent

MartinezRuiz-Carlos commented 4 years ago

Hi Florent, /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/logs/ALE/ only contains an empty directory: ale_fullgenetree_dated_1. In the logs directory there's also the file get_orthologues_from_ALE_recs_ortholog_collection_1.log and the directory replspebypop, created at the same time as the ALE directory. Both are also empty. As for the whole pipeline logs for step 07, where would I find those? The other files and directories in the logs directory were created in previous steps. Sorry, I realise this does not provide an awful lot of information. Thanks!

Best, Carlos

flass commented 4 years ago

OK so ALE. seem to just not have run. by the pipeline logs for step 07, I was meaning the standard output/error of the pipeline, for which you provide above only the end of the stream; can you please give the whole o/e stream from the beginning of the task?

As for the issue itself, I think it is caused by an empty list of gene tree jobs for ALE to reconcile. can you have a look at what's in: ${alerec}/${collapsecond}_${replmethod}_Gtrees_list (in your environment I would think it is /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/07.reconciliations/fullgenetree/nocollapse_noreplace_Gtrees_list, corresponding to the list of files matching ${coltreechains}/${collapsecond}/${replmethod}/*-Gtrees.nwk (i.e. /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/07.reconciliations/fullgenetree/nocollapse/noreplace/*-Gtrees.nwk can you please check what you've got in there? and if there is nothing, maybe go back to the logs of the pipeline task 06, see why it got lazy there.

Best, Florent

MartinezRuiz-Carlos commented 4 years ago

Here's the full stdout from step 07:

This is Pantagruel pipeline version 13d8303229705f3c2c0092289b00e8c48bce4b07 using source code from repository '/pantagruel'

will try and resume computation of task where it was last stopped
# will run tasks: 7
[2020-07-17 21:34:22] Pantagruel pipeline task 7: compute species tree/gene tree reconciliations.
Will use the reconciliation method: ALE
Task folder '/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/07.reconciliations' already exists; -R|--resume option was used so Pantagruel will atempt to resume from an interupted previous run
no gene tree left to reconcile, skip reconciliation computation

# Reconciliation collection details:
1   2020-07-17  using ALE software (version v1.0) compiled from source; code origin: https://github.com/ssolo/ALE; code version 265fc4de061f47a4f38c51dc9cfc7a3dda05654e    ale_fullgenetree_dated_1
parsing ALE scenarios
Successfully parsed ALE scenarios

# Parsed reconciliation collection details:
1   2020-07-17  ale_fullgenetree_dated_1_parsed_1
Resume mode: first clean the database from previous inserts and indexes
currently set variable:
database='/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/03.database' dbfile='/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/03.database/db_sc3' parsedreccolid='1'
Successfully cleaned the database from previous inserts and indexes
Storing reconciliation parameters and load parsed reconciliation data into database
currently set variable:
database='/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/03.database' dbfile='/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/03.database/db_sc3' parsedrecs='/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/07.reconciliations/parsed_recs/ale_fullgenetree_dated_1_parsed_1' ALEversion='' ALEalgo='ALEml' ALEsourcenote='using ALE software (version v1.0) compiled from source; code origin: https://github.com/ssolo/ALE; code version 265fc4de061f47a4f38c51dc9cfc7a3dda05654e' parsedreccol='ale_fullgenetree_dated_1_parsed_1' parsedreccolid='1' parsedreccoldate='2020-07-17'
Successfully Stored reconciliation parameters and load parsed reconciliation data into database
0 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/07.reconciliations/parsed_recs/ale_fullgenetree_dated_1_parsed_1/summary_gene_tree_events_minfreq0.1
0 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/07.reconciliations/parsed_recs/ale_fullgenetree_dated_1_parsed_1/summary_gene_tree_events_minfreq0.25
0 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/07.reconciliations/parsed_recs/ale_fullgenetree_dated_1_parsed_1/summary_gene_tree_events_minfreq0.5
Successfully reconciled gene trees with ALE
Pantagruel pipeline task 7: complete.

The directory you mentioned exists but is empty. I went back and looked at the outputs from step 06, all directories have outputs in them except raxml_trees and fullgenetree_tree_chains/nocollapse, which are empty. Step 06 definitely did something, it ran for about 10 days! It did not produce any errors or warnings. I hope that helps!

Best, Carlos

flass commented 4 years ago

Hi Carlos, so indeed there was an issue at task 06: the folder fullgenetree_tree_chains/nocollapse/ you mention to be empty should be full of files! it should at least contain the MrBayes tree chain converted to Newick format (with tags *-Gtrees.nwk). it being empty, there is nothing to reconcile during task 07... so it might be worth looking at what has been done during task 06, especially its step 4, looking at the general pipeline stdout/stderr log, but also the gene family specific logs in logs/replspebypop/. I know that in the past there has been issues where that step returning an error has gone unnoticed by the pipeline safety checks, so maybe it failed there.

MartinezRuiz-Carlos commented 4 years ago

Perhaps unsurprisingly logs/replspebypop is also empty. I re-run step 06 with the recovery option and this is the stdout I get:


will try and resume computation of task where it was last stopped
 will run tasks: 6
[2020-07-23 17:59:11] Pantagruel pipeline task 6: compute gene trees.
Task folder '/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/06.gene_trees' already exists; -R|--resume option was used so Pantagruel will atempt to resume from an interupted previous run
succesfully generated gene family list : /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/06.gene_trees/cdsfams_minsize4
Resume task 6: skip converting alignments
succesfully converted alignemts from Fasta to Nexus format
create folders for MrBayes ouput, broken down by gene family prefixes, in '/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/06.gene_trees/fullgenetree_mrbayes_trees/nocollapse/'
Resume task 6, step 3: 1493 bayesian tree chains already complete
Resume task 6, step 3: 2 bayesian tree chains remain to compute
step 3: Will now run MrBayes in parallel (i.e. sequentially for each gene alignment, with several alignments processed in parallel
with options: Nruns=2 Ngen=2000000 Nchains=4 Samplefreq=500 

PANTAGFAMC001362.codes

current directory (output directory) is aff825c282a4:/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/06.gene_trees/fullgenetree_mrbayes_trees/nocollapse/PANTAGFAMC0013
rsync -avz /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/06.gene_trees/fullgenetree_cdsfam_alignments_species_code/nocollapse/PANTAGFAMC001362.codes.nex ./
sending incremental file list

sent 67 bytes  received 12 bytes  158.00 bytes/sec
total size is 7,376  speedup is 93.37
copied input files with exit status 0
ls ./*PANTAGFAMC001362.codes*
./PANTAGFAMC001362.codes.mb.ckp
./PANTAGFAMC001362.codes.mb.ckp~
./PANTAGFAMC001362.codes.mb.con.tre
./PANTAGFAMC001362.codes.mb.log
./PANTAGFAMC001362.codes.mb.lstat
./PANTAGFAMC001362.codes.mb.mcmc
./PANTAGFAMC001362.codes.mb.parts
./PANTAGFAMC001362.codes.mb.pstat
./PANTAGFAMC001362.codes.mb.run1.p
./PANTAGFAMC001362.codes.mb.run1.t
./PANTAGFAMC001362.codes.mb.run2.p
./PANTAGFAMC001362.codes.mb.run2.t
./PANTAGFAMC001362.codes.mb.trprobs
./PANTAGFAMC001362.codes.mb.tstat
./PANTAGFAMC001362.codes.mb.vstat
./PANTAGFAMC001362.codes.mbparam.txt
./PANTAGFAMC001362.codes.nex

mbmcmcopt=Nruns=2 Ngen=2000000 Nchains=4 Samplefreq=500 
data matrix:
    dimensions ntax=5 nchar=1197;

set autoclose=yes nowarn=yes
execute PANTAGFAMC001362.codes.nex
lset nst=6 rates=invgamma
showmodel
mcmcp Nruns=2 Ngen=2000000 Nchains=4 Samplefreq=500  file=PANTAGFAMC001362.codes.mb
mcmc
sumt minpartfreq=0.001 contype=allcompat
sump
quit

running MrBayes:
mb < PANTAGFAMC001362.codes.mbparam.txt
output of MrBayes phylogenetic reconstruction is :
ls ./*PANTAGFAMC001362.codes*
./PANTAGFAMC001362.codes.mb.ckp
./PANTAGFAMC001362.codes.mb.ckp~
./PANTAGFAMC001362.codes.mb.con.tre
./PANTAGFAMC001362.codes.mb.log
./PANTAGFAMC001362.codes.mb.lstat
./PANTAGFAMC001362.codes.mb.mcmc
./PANTAGFAMC001362.codes.mb.parts
./PANTAGFAMC001362.codes.mb.pstat
./PANTAGFAMC001362.codes.mb.run1.p
./PANTAGFAMC001362.codes.mb.run1.t
./PANTAGFAMC001362.codes.mb.run2.p
./PANTAGFAMC001362.codes.mb.run2.t
./PANTAGFAMC001362.codes.mb.trprobs
./PANTAGFAMC001362.codes.mb.tstat
./PANTAGFAMC001362.codes.mb.vstat
./PANTAGFAMC001362.codes.mbparam.txt
./PANTAGFAMC001362.codes.nex

PANTAGFAMC000266.codes

current directory (output directory) is aff825c282a4:/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/06.gene_trees/fullgenetree_mrbayes_trees/nocollapse/PANTAGFAMC0002
rsync -avz /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/06.gene_trees/fullgenetree_cdsfam_alignments_species_code/nocollapse/PANTAGFAMC000266.codes.nex ./
sending incremental file list

sent 66 bytes  received 12 bytes  156.00 bytes/sec
total size is 11,970  speedup is 153.46
copied input files with exit status 0
ls ./*PANTAGFAMC000266.codes*
./PANTAGFAMC000266.codes.mb.ckp
./PANTAGFAMC000266.codes.mb.ckp~
./PANTAGFAMC000266.codes.mb.con.tre
./PANTAGFAMC000266.codes.mb.log
./PANTAGFAMC000266.codes.mb.lstat
./PANTAGFAMC000266.codes.mb.mcmc
./PANTAGFAMC000266.codes.mb.parts
./PANTAGFAMC000266.codes.mb.pstat
./PANTAGFAMC000266.codes.mb.run1.p
./PANTAGFAMC000266.codes.mb.run1.t
./PANTAGFAMC000266.codes.mb.run2.p
./PANTAGFAMC000266.codes.mb.run2.t
./PANTAGFAMC000266.codes.mb.trprobs
./PANTAGFAMC000266.codes.mb.tstat
./PANTAGFAMC000266.codes.mb.vstat
./PANTAGFAMC000266.codes.mbparam.txt
./PANTAGFAMC000266.codes.nex

mbmcmcopt=Nruns=2 Ngen=2000000 Nchains=4 Samplefreq=500 
data matrix:
    dimensions ntax=12 nchar=975;

set autoclose=yes nowarn=yes
execute PANTAGFAMC000266.codes.nex
lset nst=6 rates=invgamma
showmodel
mcmcp Nruns=2 Ngen=2000000 Nchains=4 Samplefreq=500  file=PANTAGFAMC000266.codes.mb
mcmc
sumt minpartfreq=0.001 contype=allcompat
sump
quit

running MrBayes:
mb < PANTAGFAMC000266.codes.mbparam.txt
output of MrBayes phylogenetic reconstruction is :
ls ./*PANTAGFAMC000266.codes*
./PANTAGFAMC000266.codes.mb.ckp
./PANTAGFAMC000266.codes.mb.ckp~
./PANTAGFAMC000266.codes.mb.con.tre
./PANTAGFAMC000266.codes.mb.log
./PANTAGFAMC000266.codes.mb.lstat
./PANTAGFAMC000266.codes.mb.mcmc
./PANTAGFAMC000266.codes.mb.parts
./PANTAGFAMC000266.codes.mb.pstat
./PANTAGFAMC000266.codes.mb.run1.p
./PANTAGFAMC000266.codes.mb.run1.t
./PANTAGFAMC000266.codes.mb.run2.p
./PANTAGFAMC000266.codes.mb.run2.t
./PANTAGFAMC000266.codes.mb.trprobs
./PANTAGFAMC000266.codes.mb.tstat
./PANTAGFAMC000266.codes.mb.vstat
./PANTAGFAMC000266.codes.mbparam.txt
./PANTAGFAMC000266.codes.nex

step 3: MrBayes tree estimation complete
Resume task 6, step 4: 1495 bayesian tree chains remain to be processed for format conversion and replacement of collapsed clades
step 4: conversion of gene tree chains complete
Pantagruel pipeline task 6: complete.
flass commented 4 years ago

and so after re-running task 06, you still have no output in 06.gene_trees/fullgenetree_tree_chains/nocollapse/noreplace/ or any logs in logs/replspebypop/ ?

I'm concerned something is going wrong at this point but somehow the error log is not reported.

maybe you could try and run the script for the final step of task 06 (step 4: conversion of gene tree chains) on its own see what it does? here is the commands I suggest you run:

source /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/environ_pantagruel_db_sc3.sh
repltasklist=${bayesgenetrees}_${collapsecond}_nexus_list_resume
ptgthreads=1
repllogs=${ptgdb}/logs/replspebypop/replace_species_by_pop_in_gene_trees
replrun=test
echo "coltreechains=${coltreechains}; collapsecond=${collapsecond}; repltasklist=${repltasklist}"
python2.7 ${ptgscripts}/replace_species_by_pop_in_gene_trees.py -G ${repltasklist} --no_replace -o ${coltreechains}/${collapsecond} --threads=${ptgthreads} --reuse=0 --verbose=2 --logfile=${repllogs}_${replrun}.log

if really nothing comes out of it in terms of stdout/stderr, maybe you can omit the --logfile option, which redirect these streams - maybe there is something happening at this levels.

Also it's important you start with ptgthreads=1 as the stdout/stderr is more easily reported in sequential mode, and then maybe you can try with a higher number of CPU to replicate any potential bug linked to the multithreading - and anyway this conversion step would take too long otherwise.

I hope we'll get through this weird bug soon!

MartinezRuiz-Carlos commented 4 years ago

Hello Florent, Sorry for the delay in my response I ran the commands you mentioned, they failed because ${repltasklist} produced /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/06.gene_trees/fullgenetree_mrbayes_trees__nexus_list_resume, a file that did not exist. By looking at the script it is not clear to me why ${collapsecond} did not produce any output. I re-ran the last step of the commands you posted (the python 2.7) replacing ${repltasklist} by /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/06.gene_trees/fullgenetree_mrbayes_trees_nocollapse_nexus_list_resume. This seemed to have worked, I now have 1495 -Gtrees.nwk files in panta_out/db_sc3/06.gene_trees/fullgenetree_tree_chains/noreplace/. panta_out/db_sc3/06.gene_trees/fullgenetree_tree_chains/nocollapse/ is still empty though. The first few lines of the log file look like this: /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/06.gene_trees/fullgenetree_mrbayes_trees/nocollapse/PANTAGFAMC0005/PANTAGFAMC000580.codes /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/06.gene_trees/fullgenetree_mrbayes_trees/nocollapse/PANTAGFAMC0005/PANTAGFAMC000580.cod method: noreplace mapPop2GeneTree() parseChain('['/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/06.gene_trees/fullgenetree_mrbayes_trees/nocollapse/PANTAGFAMC0005/PANTAGFA ^M50^M100^M150^M200^M250^M300^M350^M400^M450^M500^M550^M600^M650^M700^M750^M800^M850^M900^M950^M1000^M1050^M1100^M1150^M1200^M1250^M1300^M1350 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/06.gene_trees/fullgenetree_tree_chains/noreplace/PANTAGFAMC000580-Gtrees.nwk ...done (800 And go on for a while, so again, I think this step worked now. If I try to run step 07 I still get the same as before though, so this running has still not solved the initial issue.

It seems we are getting somewhere though, thanks again!

flass commented 4 years ago

Hi Carlos,

OK so that makes much more sense now. the problem is indeed that the ${collapsecond} variable is not defined. Because all the scripts rely on it being there, It's normal that it fails.

So of course I'll track the bug so that future runs work well, but also you will need to do some hacking (simple) to fix the database folder structure, otherwise the pipeline scripts (in particular task 07) won't work.

First you need to move the *Gtrees.nwk files to where they are expected to be:

mkdir -p panta_out/db_sc3/06.gene_trees/fullgenetree_tree_chains/nocollapse/noreplace/
mv panta_out/db_sc3/06.gene_trees/fullgenetree_tree_chains/nocollapse/* panta_out/db_sc3/06.gene_trees/fullgenetree_tree_chains/nocollapse/noreplace/

At this point you could try to run task 07 again see if it's enough to fix it.

also I'm a bit puzzle as to why when you ran source /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/environ_pantagruel_db_sc3.sh you get an empty ${collapsecond} variable ; do you get any errors when you source this files?

MartinezRuiz-Carlos commented 4 years ago

Hello Florent,

Yes, finally that solved the issue, step 07 is running now, thanks! I normally run pantagruel from the Docker container, but source cannot be run through Docker, so I simply mimicked the directory structure in the container in my local computer by linking pantagruel to root, I got the following errors: /pantagruel/scripts/pipeline/environ_pantagruel_secondaryvars.sh:51: bad substitution /pantagruel/scripts/pipeline/pantagruel_pipeline_functions.sh:export:7: invalid option(s) /pantagruel/scripts/pipeline/pantagruel_pipeline_functions.sh:export:43: invalid option(s) /pantagruel/scripts/pipeline/pantagruel_pipeline_functions.sh:export:56: invalid option(s) But still got most of the variables (except for ${collapsecond}). Sorry this is a bit convoluted, I hope it helps!

flass commented 4 years ago

all right, the reason why you get these errors is because variables like ${collapsecond} are secondary variables, which are not directly defined by that environ_pantagruel_db_sc3.sh file, but as you can see in the errors, by sourcing scripts that are part of the pantagruel/ repository. However, the path to the repo is pointing to a location internal to the Docker container, i.e. /pantagruel/. so if you want to be able to source the environ_pantagruel_db_sc3.sh file properly outside the docker container, you need a copy of the repository somewhere, and make the environ file refer to it. You can do that by linking the repo to the equivalent location / on your computer (but this require admin rights):

sudo ln -s /path/to/your/repo/pantagruel /

or to use pantagruel --refresh init, but in that case be aware you're modifying your environ file and you'll have to do the reverse refresh with the Docker container's pantagruel. one way to avoid that is to do a copy of the environ file:

cp -p environ_pantagruel_db_sc3.sh environ_pantagruel_db_sc3.sh.docker # keep that one for your future docker-based runs
/path/to/your/repo/pantagruel/pantagruel -i environ_pantagruel_db_sc3.sh --refresh init

I hope this helps!

Now, as to why your variable ${collapsecond} was not defined in the first place, which led to that cock up, I still need to investigate as everything seems in order... maybe you happen to have used a code that was instable at the time of running task 06?

MartinezRuiz-Carlos commented 4 years ago

Hello Florent, Yes that makes sense, I have been using version 13d8303229705f3c2c0092289b00e8c48bce4b07 I am not sure if that helps. I have a potentially related issue, this time with step 08, apologies if this is unrelated though. When running it I get

will try and resume computation of task where it was last stopped
 will run tasks: 8
[2020-08-03 16:45:52] Pantagruel pipeline task 8: classify genes into orthologous groups (OGs) and search clade-specific OGs.
Task folder '/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs' already exists; -R|--resume option was used so Pantagruel will atempt to resume from an interupted previous run
generating ortholog collection from reconciled gene trees
 call: python2.7 /pantagruel/scripts/get_orthologues_from_ALE_recs.py -i /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/07.reconciliations/fullgenetree_ALE_recs/nocollapse/noreplace/ale_fullgenetree_dated_1 -o /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1 --threads=4  --ale.model=dated --methods=mixed --max.frac.extra.spe=0.5 --majrule.combine=0.5 --colour.combined.tree --use.unreconciled.gene.trees= --unreconciled.format=nexus --unreconciled.ext=.con.tre &> /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/logs/get_orthologues_from_ALE_recs_ortholog_collection_1.log
step 1: complete generating ortholog collection from reconciled gene trees

importing ortholog classification into database
first delete previous records for this ortholog collection ('ortholog_collection_1') in the database '/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/03.database/db_sc3'
step 2.0: completed importing ortholog collection record into database
step 2.1: completed importing ortholog classification into database for reconciled gene trees
step 2.2: completed importing ortholog classification into database for unreconciled gene trees

generating abs/pres matrix
ortholog_collection_1
building matrix of gene presence / absence for 9 genomes
examining a total of 12545 CDSs with non-ORFan family assignment
retrieveing orthology classification from collection: ortholog_col_id=1
1495 families not covererd by orthology classification (means no evolution scenario was inferred for these families)
0 families covererd by orthology classification into a total of 0 orthologous groups
these totalize 5 families with unique representative in the dataset (singletons) and 1490 others [total: 1495]
step 3: completed generating abs/pres matrix

listing clade-specific orthologs
ERROR: step 4: failed listing clade-specific orthologs; check specific logs in '/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/logs/get_clade_specific_genes.log' for more details
ERROR: Pantagruel pipeline task 8: failed.

The error in the log file is a bit weird, it looks like again something was missing?

[1] use ortholog classification of homologous genes
[1] "load 'genocount' table of ortholog cluster occurrence in genomes frmo file '/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/mixed_majrule_combined_0.5.orthologs_gene_abspres.mat.RData'"
Error: unexpected symbol in:
"                                           "        an count matrix of homologous gene families generated by Pantagruel task 02."
                                spe"
Execution halted

I dug around the script for a bit but again could not find where that error may be coming from. Sorry to be a pain, but it seems like it is getting there!

flass commented 4 years ago

Hi Carlos, that's fine, no worries - bugs are my fault not yours, and I'm very happy you are using Pantagruel regardless of it still being rough around the edges!

This is an unrelated bug, which I think I already addressed in 96b6c77 (532610b for branch master); the latest code (or Docker image) should have this right.

However, I do expect you'll run into some stopping bug during tasks 08 or 09, simply because I have not yet finished the interface to parse the GeneRax outputs; I'm almost there but I still need to wrap it up and was a bit short of time to do that lately. I'm on holidays at the moment so won't touch it before a week or two; sorry if that delays your research. That said, there is a script that allows you to parse the GeneRax output outside of the pipeline integration, at least to have an idea of what events occurred per gene family over the Species tree and at what frequency: https://github.com/flass/pantagruel/blob/usingGeneRax/scripts/parse_generax_nhx.py (you'll need to set the PYTHONPATH variable beforehand: export PYTHONPATH=$ptgrepo/python_libs:$PYTHONPATH)

I hope it helps! Florent