ERROR: step 6: failed comparing each clade-specific core genome

Hello Florent,

The pipeline is going further but again I am afraid I have encountered a new error. First, the topGO library was not installed in the Dockerhub build, I solved it by manually installing gfortran and then topGO manually in the image. This is something maybe you want to look at. Unfortunately that did not solve the issue, I now encounter this error:

This is Pantagruel pipeline version 8b582a05afeb3f06ed346fa281d5eec81b77ab13 using source code from repository '/pantagruel'

will try and resume computation of task where it was last stopped
# will run tasks: 8
[2020-09-23 06:58:01] Pantagruel pipeline task 8: classify genes into orthologous groups (OGs) and search clade-specific OGs.
Task folder '/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs' already exists; -R|--resume option was used so Pantagruel will atempt to resume from an interupted previous run
generating ortholog collection from reconciled gene trees
# call: python2.7 /pantagruel/scripts/get_orthologues_from_ALE_recs.py -i /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/07.reconciliations/fullgenetree_ALE_recs/nocollapse/noreplace/ale_fullgenetree_dated_1 -o /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1 --threads=4  --ale.model=dated --methods=mixed --max.frac.extra.spe=0.5 --majrule.combine=0.5 --colour.combined.tree --use.unreconciled.gene.trees= --unreconciled.format=nexus --unreconciled.ext=.con.tre &> /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/logs/get_orthologues_from_ALE_recs_ortholog_collection_1.log
step 1: complete generating ortholog collection from reconciled gene trees

importing ortholog classification into database
first delete previous records for this ortholog collection ('ortholog_collection_1') in the database '/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/03.database/db_sc3'
step 2.0: completed importing ortholog collection record into database
step 2.1: completed importing ortholog classification into database for reconciled gene trees
step 2.2: completed importing ortholog classification into database for unreconciled gene trees

step 3: generating abs/pres matrix
ortholog_collection_1
building matrix of gene presence / absence for 9 genomes
examining a total of 12545 CDSs with non-ORFan family assignment
retrieveing orthology classification from collection: ortholog_col_id=1
1495 families not covererd by orthology classification (means no evolution scenario was inferred for these families)
0 families covererd by orthology classification into a total of 0 orthologous groups
these totalize 5 families with unique representative in the dataset (singletons) and 1490 others [total: 1495]
step 3: completed generating abs/pres matrix

listing clade-specific orthologs
step 4: completed listing clade-specific orthologs

null device 
          1 
Found 52422 functional annotation records linked to GO terms in the database '/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/03.database/db_sc3'
Will now run GO term enrichment tests
step 5.1: generating core genome background term distribution for clades
step 5.0: generating core genome background term distribution for the whole dataset
-rw-r--r-- 1 1000 1000 0 Sep 23 07:00 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/9-genomes_coregenome_terms.tab
-rw-r--r-- 1 1000 1000 0 Sep 23 07:00 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/9-genomes_coregenome_terms.tab_nonull
step 5.1: generating core genome background term distribution for each clade in the tree based on ortholog collection 1
clade0  (repr.: 'CUNDIV1'; size: 3) 'CUNDIV1','CUNDIV2','GPL37'
-rw-r--r-- 1 1000 1000 0 Sep 23 07:00 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade0_coregenome_terms.tab
-rw-r--r-- 1 1000 1000 0 Sep 23 07:00 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade0_coregenome_terms.tab_nonull
clade1  (repr.: 'CUNDIV1'; size: 6) 'ACIDIP','FAM36','FAR37','FERACI1','FERACI2','FTT37'
-rw-r--r-- 1 1000 1000 0 Sep 23 07:00 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade1_coregenome_terms.tab
-rw-r--r-- 1 1000 1000 0 Sep 23 07:00 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade1_coregenome_terms.tab_nonull
clade2  (repr.: 'CUNDIV1'; size: 2) 'CUNDIV1','CUNDIV2'
-rw-r--r-- 1 1000 1000 0 Sep 23 07:00 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade2_coregenome_terms.tab
-rw-r--r-- 1 1000 1000 0 Sep 23 07:00 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade2_coregenome_terms.tab_nonull
clade3  (repr.: 'CUNDIV1'; size: 5) 'FAM36','FAR37','FERACI1','FERACI2','FTT37'
-rw-r--r-- 1 1000 1000 0 Sep 23 07:00 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade3_coregenome_terms.tab
-rw-r--r-- 1 1000 1000 0 Sep 23 07:00 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade3_coregenome_terms.tab_nonull
clade4  (repr.: 'CUNDIV1'; size: 4) 'FAM36','FAR37','FERACI1','FERACI2'
-rw-r--r-- 1 1000 1000 0 Sep 23 07:00 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade4_coregenome_terms.tab
-rw-r--r-- 1 1000 1000 0 Sep 23 07:00 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade4_coregenome_terms.tab_nonull
clade5  (repr.: 'CUNDIV1'; size: 3) 'FAM36','FAR37','FERACI2'
-rw-r--r-- 1 1000 1000 0 Sep 23 07:00 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade5_coregenome_terms.tab
-rw-r--r-- 1 1000 1000 0 Sep 23 07:00 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade5_coregenome_terms.tab_nonull
clade6  (repr.: 'CUNDIV1'; size: 2) 'FAM36','FAR37'
-rw-r--r-- 1 1000 1000 0 Sep 23 07:00 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade6_coregenome_terms.tab
-rw-r--r-- 1 1000 1000 0 Sep 23 07:00 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade6_coregenome_terms.tab_nonull
step 5.2: completed 

step 5.2: 
-rw-r--r-- 1 1000 1000 769K Sep 23 07:00 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/9-genomes_pangenome_terms.tab
-rw-r--r-- 1 1000 1000 567K Sep 23 07:00 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/9-genomes_pangenome_terms.tab_nonul
clade0  'CUNDIV1','CUNDIV2','GPL37'
-rw-r--r-- 1 1000 1000 252K Sep 23 07:00 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade0_pangenome_terms.tab
-rw-r--r-- 1 1000 1000 184K Sep 23 07:00 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade0_pangenome_terms.tab_nonull
clade1  'ACIDIP','FAM36','FAR37','FERACI1','FERACI2','FTT37'
-rw-r--r-- 1 1000 1000 517K Sep 23 07:00 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade1_pangenome_terms.tab
-rw-r--r-- 1 1000 1000 383K Sep 23 07:00 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade1_pangenome_terms.tab_nonull
clade2  'CUNDIV1','CUNDIV2'
-rw-r--r-- 1 1000 1000 173K Sep 23 07:00 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade2_pangenome_terms.tab
-rw-r--r-- 1 1000 1000 127K Sep 23 07:00 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade2_pangenome_terms.tab_nonull
clade3  'FAM36','FAR37','FERACI1','FERACI2','FTT37'
-rw-r--r-- 1 1000 1000 426K Sep 23 07:00 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade3_pangenome_terms.tab
-rw-r--r-- 1 1000 1000 315K Sep 23 07:00 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade3_pangenome_terms.tab_nonull
clade4  'FAM36','FAR37','FERACI1','FERACI2'
-rw-r--r-- 1 1000 1000 351K Sep 23 07:00 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade4_pangenome_terms.tab
-rw-r--r-- 1 1000 1000 259K Sep 23 07:00 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade4_pangenome_terms.tab_nonull
clade5  'FAM36','FAR37','FERACI2'
-rw-r--r-- 1 1000 1000 255K Sep 23 07:00 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade5_pangenome_terms.tab
-rw-r--r-- 1 1000 1000 188K Sep 23 07:00 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade5_pangenome_terms.tab_nonull
clade6  'FAM36','FAR37'
-rw-r--r-- 1 1000 1000 169K Sep 23 07:00 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade6_pangenome_terms.tab
-rw-r--r-- 1 1000 1000 123K Sep 23 07:00 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade6_pangenome_terms.tab_nonull
step 5.2: completed 

step 6: comparing each clade-specific core genome to its respective core genome
ERROR: step 6: failed comparing each clade-specific core genome to its respective core genome; check specific logs in '/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/logs/GOterm_enrichment/cladespecific_vs_coregenome_genes*' for more details
ERROR: Pantagruel pipeline task 8: failed.

I have tracked it back to this file being empty: /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade0_coregenome_terms.tab_nonull The log from ALE is also empty, so maybe there is the issue? panta_out/db_sc3/logs/get_orthologues_from_ALE_recs_ortholog_collection_1.log Thanks again!

Hi Carlos,

thank you for reporting this. i fixed the building of the R package topGO by adding the gfortran deb package to the Dockerfile recipe (fixed in [master] 55e275b and [usingGeneRax] 07602e5).

Regarding your other issue, it is failing on the step that requires topGO ; it means that the package you installed is not accessible from the point of view of the Docker image, which is expected. Please update your Docker image and hopefully it should be fixed.

Best,

Florent

Hello Florent,

Sorry to re-opening this issue but it seems to be independent of the previous topGO problem (sorry it took me a while to come back to you, I re-ran the whole pipeline as the version I was using was quite old). As I mentioned in my previous comment, all the tab_nonull files are empty, leading to the R error. The tab files do have info in them e.g.

FACI_RS00005    GO:0003677
FACI_RS00005    GO:0004803
FACI_RS00005    GO:0006313
FACI_RS00005    
FACI_RS00010    GO:0004521
FACI_RS00010    GO:0043571
FACI_RS00010    
FACI_RS00015    GO:0003676
FACI_RS00015    GO:0004519
FACI_RS00015    GO:0043571

So my guess is that whatever the issue is must be linked to this step in pantagruel_pipeline_08_clade_specific_genes.sh

tail -n +2 ${cladedefs} | while read cla ${cladedefhead} ; do
  claspeset="'$(echo ${clade} | sed -e "s/,/','/g")'"
  echo "$cla $name $claspeset"
  cladest=${claderefgodir}/${cla}_pangenome_terms.tab
  qpancla="select distinct locus_tag, go_id from coding_sequences 
  inner join replicons using (genomic_accession)
  inner join assemblies using (assembly_id)
  left join functional_annotations using (nr_protein_id) 
  left join interpro2GO using (interpro_id) 
  where code in (${claspeset})"
  sqlite3 -cmd ".mode tab" ${sqldb} "${qpancla};" > ${cladest}
  checkexec "step 5.2: failed ${step5} for clade ${cla} including NULL go_id"
  sqlite3 -cmd ".mode tab" ${sqldb} "${qpancla} and go_id not null;" > ${cladest}_nonull

Is the issue that my genomes don't have any non-null GO ids for some reason? Could this be related to my data specifically? Many thanks!

Best, Carlos

Hi Carlos, thanks for reporting this error. I don't think it has anything to do with your data not having any GO ids tagged on its genes (would be really unlikely) as you show that your table (I assume one of clade*_pangenome_terms.tab have a non-empty second column, which is the go_id column. I can't really see why the paired *tab_nonull file generated by the same command with just an added filter (go_id not null) to the where clause would be empty. So can you confirm that the files you see empty and with content are of the same kind? Indeed I can see in the first message you posted that the files clade*_coregenome_terms.tab* are consistently empty (*tab or *tab_nonull), while the files clade*_pangenome_terms.tab* (*tab or *tab_nonull) always have some content.

Can you please run the following ls command just to verify the content of files that where produced from the same run:

ls -lh 08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/

The output of this command should actually already be present in the logs from task 08.

I expect that the core genome background term files (clade*_coregenome_terms.tab*) will be specifically empty, meaning there is something wrong with the command that generates them (step 5.1 of task 08; I've been battling with it for a while).

Thanks for you quick response Florent, Here's the output from the ls command. As you expected, the tab* files are empty:

-rw-r--r-- 1 carlos carlos    0 oct 11 10:43 clade0_coregenome_terms.tab
-rw-r--r-- 1 carlos carlos    0 oct 11 10:43 clade0_coregenome_terms.tab_nonull
-rw-r--r-- 1 carlos carlos 250K oct 11 10:43 clade0_pangenome_terms.tab
-rw-r--r-- 1 carlos carlos 183K oct 11 10:43 clade0_pangenome_terms.tab_nonull
-rw-r--r-- 1 carlos carlos    0 oct 11 10:43 clade1_coregenome_terms.tab
-rw-r--r-- 1 carlos carlos    0 oct 11 10:43 clade1_coregenome_terms.tab_nonull
-rw-r--r-- 1 carlos carlos 515K oct 11 10:43 clade1_pangenome_terms.tab
-rw-r--r-- 1 carlos carlos 380K oct 11 10:43 clade1_pangenome_terms.tab_nonull
-rw-r--r-- 1 carlos carlos    0 oct 11 10:43 clade2_coregenome_terms.tab
-rw-r--r-- 1 carlos carlos    0 oct 11 10:43 clade2_coregenome_terms.tab_nonull
-rw-r--r-- 1 carlos carlos 172K oct 11 10:43 clade2_pangenome_terms.tab
-rw-r--r-- 1 carlos carlos 126K oct 11 10:43 clade2_pangenome_terms.tab_nonull
-rw-r--r-- 1 carlos carlos    0 oct 11 10:43 clade3_coregenome_terms.tab
-rw-r--r-- 1 carlos carlos    0 oct 11 10:43 clade3_coregenome_terms.tab_nonull
-rw-r--r-- 1 carlos carlos 424K oct 11 10:44 clade3_pangenome_terms.tab
-rw-r--r-- 1 carlos carlos 313K oct 11 10:44 clade3_pangenome_terms.tab_nonull
-rw-r--r-- 1 carlos carlos    0 oct 11 10:43 clade4_coregenome_terms.tab
-rw-r--r-- 1 carlos carlos    0 oct 11 10:43 clade4_coregenome_terms.tab_nonull
-rw-r--r-- 1 carlos carlos 349K oct 11 10:44 clade4_pangenome_terms.tab
-rw-r--r-- 1 carlos carlos 258K oct 11 10:44 clade4_pangenome_terms.tab_nonull
-rw-r--r-- 1 carlos carlos    0 oct 11 10:43 clade5_coregenome_terms.tab
-rw-r--r-- 1 carlos carlos    0 oct 11 10:43 clade5_coregenome_terms.tab_nonull
-rw-r--r-- 1 carlos carlos 254K oct 11 10:44 clade5_pangenome_terms.tab
-rw-r--r-- 1 carlos carlos 187K oct 11 10:44 clade5_pangenome_terms.tab_nonull
-rw-r--r-- 1 carlos carlos    0 oct 11 10:43 clade6_coregenome_terms.tab
-rw-r--r-- 1 carlos carlos    0 oct 11 10:43 clade6_coregenome_terms.tab_nonull
-rw-r--r-- 1 carlos carlos 168K oct 11 10:44 clade6_pangenome_terms.tab
-rw-r--r-- 1 carlos carlos 122K oct 11 10:44 clade6_pangenome_terms.tab_nonull

OK, thanks. Actually, can you copy-paste the logs you have got for task 08 with the last run of the pipeline? There is something wrong showing in the logs above (representative genome for core gene sets is invariant even when not included in the focus clade), but I'm not sure this would still be the case in the last version.

Here's the full stdout for the latest run:

This is Pantagruel pipeline version 55e275b525a5d4ca40b278ff9523a0f89447468c using source code from repository '/pantagruel'

will try and resume computation of task where it was last stopped
# will run tasks: 8
[2020-10-12 10:43:17] Pantagruel pipeline task 8: classify genes into orthologous groups (OGs) and search clade-specific OGs.
Task folder '/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs' already exists; -R|--resume option was used so Pantagruel will atempt to resume from an interupted previous run
generating ortholog collection from reconciled gene trees
# call: python2.7 /pantagruel/scripts/get_orthologues_from_ALE_recs.py -i /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/07.reconciliations/fullgenetree_ALE_recs/nocollapse/noreplace/ale_fullgenetree_dated_1 -o /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1 --threads=4  --ale.model=dated --methods=mixed --max.frac.extra.spe=0.5 --majrule.combine=0.5 --colour.combined.tree --use.unreconciled.gene.trees= --unreconciled.format=nexus --unreconciled.ext=.con.tre &> /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/logs/get_orthologues_from_ALE_recs_ortholog_collection_1.log
step 1: complete generating ortholog collection from reconciled gene trees

importing ortholog classification into database
first delete previous records for this ortholog collection ('ortholog_collection_1') in the database '/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/03.database/db_sc3'
step 2.0: completed importing ortholog collection record into database
step 2.1: completed importing ortholog classification into database for reconciled gene trees
step 2.2: completed importing ortholog classification into database for unreconciled gene trees

step 3: generating abs/pres matrix
ortholog_collection_1
building matrix of gene presence / absence for 9 genomes
examining a total of 12545 CDSs with non-ORFan family assignment
retrieveing orthology classification from collection: ortholog_col_id=1
1495 families not covererd by orthology classification (means no evolution scenario was inferred for these families)
0 families covererd by orthology classification into a total of 0 orthologous groups
these totalize 5 families with unique representative in the dataset (singletons) and 1490 others [total: 1495]
step 3: completed generating abs/pres matrix

listing clade-specific orthologs
step 4: completed listing clade-specific orthologs

null device 
          1 
Found 52432 functional annotation records linked to GO terms in the database '/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/03.database/db_sc3'
Will now run GO term enrichment tests
step 5.1: generating core genome background term distribution for clades
step 5.0: generating core genome background term distribution for the whole dataset
-rw-r--r-- 1 1000 1000 0 Oct 12 10:45 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/9-genomes_coregenome_terms.tab
-rw-r--r-- 1 1000 1000 0 Oct 12 10:45 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/9-genomes_coregenome_terms.tab_nonull
step 5.1: generating core genome background term distribution for each clade in the tree based on ortholog collection 1
clade0  (repr.: 'CUNDIV1'; size: 3) 'CUNDIV1','CUNDIV2','GPL37'
-rw-r--r-- 1 1000 1000 0 Oct 12 10:45 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade0_coregenome_terms.tab
-rw-r--r-- 1 1000 1000 0 Oct 12 10:45 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade0_coregenome_terms.tab_nonull
clade1  (repr.: 'CUNDIV1'; size: 6) 'ACIDIP','FAM36','FAR37','FERACI1','FERACI2','FTT37'
-rw-r--r-- 1 1000 1000 0 Oct 12 10:45 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade1_coregenome_terms.tab
-rw-r--r-- 1 1000 1000 0 Oct 12 10:45 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade1_coregenome_terms.tab_nonull
clade2  (repr.: 'CUNDIV1'; size: 2) 'CUNDIV1','CUNDIV2'
-rw-r--r-- 1 1000 1000 0 Oct 12 10:45 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade2_coregenome_terms.tab
-rw-r--r-- 1 1000 1000 0 Oct 12 10:45 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade2_coregenome_terms.tab_nonull
clade3  (repr.: 'CUNDIV1'; size: 5) 'FAM36','FAR37','FERACI1','FERACI2','FTT37'
-rw-r--r-- 1 1000 1000 0 Oct 12 10:45 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade3_coregenome_terms.tab
-rw-r--r-- 1 1000 1000 0 Oct 12 10:45 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade3_coregenome_terms.tab_nonull
clade4  (repr.: 'CUNDIV1'; size: 4) 'FAM36','FAR37','FERACI1','FERACI2'
-rw-r--r-- 1 1000 1000 0 Oct 12 10:45 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade4_coregenome_terms.tab
-rw-r--r-- 1 1000 1000 0 Oct 12 10:45 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade4_coregenome_terms.tab_nonull
clade5  (repr.: 'CUNDIV1'; size: 3) 'FAM36','FAR37','FERACI2'
-rw-r--r-- 1 1000 1000 0 Oct 12 10:45 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade5_coregenome_terms.tab
-rw-r--r-- 1 1000 1000 0 Oct 12 10:45 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade5_coregenome_terms.tab_nonull
clade6  (repr.: 'CUNDIV1'; size: 2) 'FAM36','FAR37'
-rw-r--r-- 1 1000 1000 0 Oct 12 10:45 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade6_coregenome_terms.tab
-rw-r--r-- 1 1000 1000 0 Oct 12 10:45 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade6_coregenome_terms.tab_nonull
step 5.2: completed 

step 5.2: 
-rw-r--r-- 1 1000 1000 764K Oct 12 10:45 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/9-genomes_pangenome_terms.tab
-rw-r--r-- 1 1000 1000 562K Oct 12 10:45 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/9-genomes_pangenome_terms.tab_nonul
clade0  'CUNDIV1','CUNDIV2','GPL37'
-rw-r--r-- 1 1000 1000 250K Oct 12 10:45 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade0_pangenome_terms.tab
-rw-r--r-- 1 1000 1000 183K Oct 12 10:45 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade0_pangenome_terms.tab_nonull
clade1  'ACIDIP','FAM36','FAR37','FERACI1','FERACI2','FTT37'
-rw-r--r-- 1 1000 1000 515K Oct 12 10:45 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade1_pangenome_terms.tab
-rw-r--r-- 1 1000 1000 380K Oct 12 10:45 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade1_pangenome_terms.tab_nonull
clade2  'CUNDIV1','CUNDIV2'
-rw-r--r-- 1 1000 1000 172K Oct 12 10:45 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade2_pangenome_terms.tab
-rw-r--r-- 1 1000 1000 126K Oct 12 10:45 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade2_pangenome_terms.tab_nonull
clade3  'FAM36','FAR37','FERACI1','FERACI2','FTT37'
-rw-r--r-- 1 1000 1000 424K Oct 12 10:45 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade3_pangenome_terms.tab
-rw-r--r-- 1 1000 1000 313K Oct 12 10:45 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade3_pangenome_terms.tab_nonull
clade4  'FAM36','FAR37','FERACI1','FERACI2'
-rw-r--r-- 1 1000 1000 349K Oct 12 10:45 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade4_pangenome_terms.tab
-rw-r--r-- 1 1000 1000 258K Oct 12 10:45 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade4_pangenome_terms.tab_nonull
clade5  'FAM36','FAR37','FERACI2'
-rw-r--r-- 1 1000 1000 254K Oct 12 10:45 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade5_pangenome_terms.tab
-rw-r--r-- 1 1000 1000 187K Oct 12 10:45 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade5_pangenome_terms.tab_nonull
clade6  'FAM36','FAR37'
-rw-r--r-- 1 1000 1000 168K Oct 12 10:45 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade6_pangenome_terms.tab
-rw-r--r-- 1 1000 1000 122K Oct 12 10:45 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade6_pangenome_terms.tab_nonull
step 5.2: completed 

step 6: comparing each clade-specific core genome to its respective core genome
ERROR: step 6: failed comparing each clade-specific core genome to its respective core genome; check specific logs in '/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/logs/GOterm_enrichment/cladespecific_vs_coregenome_genes*' for more details
ERROR: Pantagruel pipeline task 8: failed.

Thank you for this. I made an attempt fix in 0967741 and 5d8f32d that should be available soon as dockerhub builds; it's only an update in the pipeline code so an increment build of your own docker image should be rapid.

I introduced a -v / --verbose option to the pipeline to get more details on what's going on with these SQLite queries in task 08. Can you please repeat this task 08 of the pipeline (you can do a clean task run with -F or just resume with -R) with the new code and with that option -v? Hopefully it should help us solve the issue.

Hello Florent, Here's the full stdout with the -v option enabled:

This is Pantagruel pipeline version 5d8f32d5dc4405a0767f9ce4949a325069f1cc28 using source code from repository '/pantagruel'

will try and resume computation of task where it was last stopped
# will run tasks: 8
[2020-10-12 21:19:22] Pantagruel pipeline task 8: classify genes into orthologous groups (OGs) and search clade-specific OGs.
Task folder '/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs' already exists; -R|--resume option was used so Pantagruel will atempt to resume from an interupted previous run
generating ortholog collection from reconciled gene trees
# call: python2.7 /pantagruel/scripts/get_orthologues_from_ALE_recs.py -i /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/07.reconciliations/fullgenetree_ALE_recs/nocollapse/noreplace/ale_fullgenetree_dated_1 -o /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1 --threads=4  --ale.model=dated --methods=mixed --max.frac.extra.spe=0.5 --majrule.combine=0.5 --colour.combined.tree --use.unreconciled.gene.trees= --unreconciled.format=nexus --unreconciled.ext=.con.tre &> /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/logs/get_orthologues_from_ALE_recs_ortholog_collection_1.log
step 1: complete generating ortholog collection from reconciled gene trees

importing ortholog classification into database
first delete previous records for this ortholog collection ('ortholog_collection_1') in the database '/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/03.database/db_sc3'
step 2.0: completed importing ortholog collection record into database
step 2.1: completed importing ortholog classification into database for reconciled gene trees
step 2.2: completed importing ortholog classification into database for unreconciled gene trees

step 3: generating abs/pres matrix
ortholog_collection_1
building matrix of gene presence / absence for 9 genomes
examining a total of 12545 CDSs with non-ORFan family assignment
retrieveing orthology classification from collection: ortholog_col_id=1
1495 families not covererd by orthology classification (means no evolution scenario was inferred for these families)
0 families covererd by orthology classification into a total of 0 orthologous groups
these totalize 5 families with unique representative in the dataset (singletons) and 1490 others [total: 1495]
step 3: completed generating abs/pres matrix

listing clade-specific orthologs
step 4: completed listing clade-specific orthologs

null device 
          1 
Found 52432 functional annotation records linked to GO terms in the database '/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/03.database/db_sc3'
Will now run GO term enrichment tests
step 5.1: generating core genome background term distribution for clades
step 5.0: generating core genome background term distribution for the whole dataset (repr.: 'ACIDIP'; size: 9)
#
select distinct locus_tag, go_id 
  FROM ( 
   SELECT gene_family_id, og_id, min(cds_code) AS cds_code
    FROM coding_sequences AS cod
    INNER JOIN replicons USING (genomic_accession) 
    INNER JOIN assemblies USING (assembly_id) 
    LEFT JOIN (
      SELECT replacement_label_or_cds_code AS cds_code, *
      FROM orthologous_groups
    ) USING (gene_family_id, cds_code)
    INNER JOIN gene_fam_og_sizes USING (gene_family_id, og_id) 
   WHERE code='ACIDIP'
   AND size=9 AND genome_present=9
   GROUP BY gene_family_id, og_id 
  ) AS q
  INNER JOIN coding_sequences using (cds_code)
  LEFT JOIN functional_annotations using (nr_protein_id) 
  LEFT JOIN interpro2GO using (interpro_id)

-rw-r--r-- 1 1000 1000 0 Oct 12 21:21 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/9-genomes_coregenome_terms.tab
-rw-r--r-- 1 1000 1000 0 Oct 12 21:21 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/9-genomes_coregenome_terms.tab_nonull
step 5.1: generating core genome background term distribution for each clade in the tree based on ortholog collection 1
clade0  (repr.: 'CUNDIV1'; size: 3) 'CUNDIV1','CUNDIV2','GPL37'
#
  CREATE TEMP TABLE panclade0 AS 
   SELECT cds_code, locus_tag, nr_protein_id, gene_family_id, og_id, code 
    FROM coding_sequences 
    INNER JOIN replicons USING (genomic_accession) 
    INNER JOIN assemblies USING (assembly_id) 
    LEFT JOIN (
      SELECT replacement_label_or_cds_code AS cds_code, *
      FROM orthologous_groups
    ) USING (gene_family_id, cds_code)
   WHERE gene_family_id IS NOT NULL 
   AND ((ortholog_col_id=1) OR (ortholog_col_id IS NULL))
   AND code IN ('CUNDIV1','CUNDIV2','GPL37');
  SELECT distinct locus_tag, go_id 
   FROM (
    SELECT gene_family_id, og_id, count(*) AS size, count(distinct code) AS genome_present
     FROM panclade0 
    GROUP BY gene_family_id, og_id
   ) AS q1
   INNER JOIN (
    SELECT * from panclade0
    WHERE code='CUNDIV1'
   ) AS q2 USING (gene_family_id, og_id)
   LEFT JOIN functional_annotations USING (nr_protein_id) 
   LEFT JOIN interpro2GO USING (interpro_id) 
  WHERE size=3 AND genome_present=3

-rw-r--r-- 1 1000 1000 0 Oct 12 21:21 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade0_coregenome_terms.tab
-rw-r--r-- 1 1000 1000 0 Oct 12 21:21 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade0_coregenome_terms.tab_nonull
clade1  (repr.: 'ACIDIP'; size: 6) 'ACIDIP','FAM36','FAR37','FERACI1','FERACI2','FTT37'
#
  CREATE TEMP TABLE panclade1 AS 
   SELECT cds_code, locus_tag, nr_protein_id, gene_family_id, og_id, code 
    FROM coding_sequences 
    INNER JOIN replicons USING (genomic_accession) 
    INNER JOIN assemblies USING (assembly_id) 
    LEFT JOIN (
      SELECT replacement_label_or_cds_code AS cds_code, *
      FROM orthologous_groups
    ) USING (gene_family_id, cds_code)
   WHERE gene_family_id IS NOT NULL 
   AND ((ortholog_col_id=1) OR (ortholog_col_id IS NULL))
   AND code IN ('ACIDIP','FAM36','FAR37','FERACI1','FERACI2','FTT37');
  SELECT distinct locus_tag, go_id 
   FROM (
    SELECT gene_family_id, og_id, count(*) AS size, count(distinct code) AS genome_present
     FROM panclade1 
    GROUP BY gene_family_id, og_id
   ) AS q1
   INNER JOIN (
    SELECT * from panclade1
    WHERE code='ACIDIP'
   ) AS q2 USING (gene_family_id, og_id)
   LEFT JOIN functional_annotations USING (nr_protein_id) 
   LEFT JOIN interpro2GO USING (interpro_id) 
  WHERE size=6 AND genome_present=6

-rw-r--r-- 1 1000 1000 0 Oct 12 21:21 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade1_coregenome_terms.tab
-rw-r--r-- 1 1000 1000 0 Oct 12 21:21 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade1_coregenome_terms.tab_nonull
clade2  (repr.: 'CUNDIV1'; size: 2) 'CUNDIV1','CUNDIV2'
#
  CREATE TEMP TABLE panclade2 AS 
   SELECT cds_code, locus_tag, nr_protein_id, gene_family_id, og_id, code 
    FROM coding_sequences 
    INNER JOIN replicons USING (genomic_accession) 
    INNER JOIN assemblies USING (assembly_id) 
    LEFT JOIN (
      SELECT replacement_label_or_cds_code AS cds_code, *
      FROM orthologous_groups
    ) USING (gene_family_id, cds_code)
   WHERE gene_family_id IS NOT NULL 
   AND ((ortholog_col_id=1) OR (ortholog_col_id IS NULL))
   AND code IN ('CUNDIV1','CUNDIV2');
  SELECT distinct locus_tag, go_id 
   FROM (
    SELECT gene_family_id, og_id, count(*) AS size, count(distinct code) AS genome_present
     FROM panclade2 
    GROUP BY gene_family_id, og_id
   ) AS q1
   INNER JOIN (
    SELECT * from panclade2
    WHERE code='CUNDIV1'
   ) AS q2 USING (gene_family_id, og_id)
   LEFT JOIN functional_annotations USING (nr_protein_id) 
   LEFT JOIN interpro2GO USING (interpro_id) 
  WHERE size=2 AND genome_present=2

-rw-r--r-- 1 1000 1000 0 Oct 12 21:21 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade2_coregenome_terms.tab
-rw-r--r-- 1 1000 1000 0 Oct 12 21:21 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade2_coregenome_terms.tab_nonull
clade3  (repr.: 'FAM36'; size: 5) 'FAM36','FAR37','FERACI1','FERACI2','FTT37'
#
  CREATE TEMP TABLE panclade3 AS 
   SELECT cds_code, locus_tag, nr_protein_id, gene_family_id, og_id, code 
    FROM coding_sequences 
    INNER JOIN replicons USING (genomic_accession) 
    INNER JOIN assemblies USING (assembly_id) 
    LEFT JOIN (
      SELECT replacement_label_or_cds_code AS cds_code, *
      FROM orthologous_groups
    ) USING (gene_family_id, cds_code)
   WHERE gene_family_id IS NOT NULL 
   AND ((ortholog_col_id=1) OR (ortholog_col_id IS NULL))
   AND code IN ('FAM36','FAR37','FERACI1','FERACI2','FTT37');
  SELECT distinct locus_tag, go_id 
   FROM (
    SELECT gene_family_id, og_id, count(*) AS size, count(distinct code) AS genome_present
     FROM panclade3 
    GROUP BY gene_family_id, og_id
   ) AS q1
   INNER JOIN (
    SELECT * from panclade3
    WHERE code='FAM36'
   ) AS q2 USING (gene_family_id, og_id)
   LEFT JOIN functional_annotations USING (nr_protein_id) 
   LEFT JOIN interpro2GO USING (interpro_id) 
  WHERE size=5 AND genome_present=5

-rw-r--r-- 1 1000 1000 0 Oct 12 21:21 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade3_coregenome_terms.tab
-rw-r--r-- 1 1000 1000 0 Oct 12 21:21 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade3_coregenome_terms.tab_nonull
clade4  (repr.: 'FAM36'; size: 4) 'FAM36','FAR37','FERACI1','FERACI2'
#
  CREATE TEMP TABLE panclade4 AS 
   SELECT cds_code, locus_tag, nr_protein_id, gene_family_id, og_id, code 
    FROM coding_sequences 
    INNER JOIN replicons USING (genomic_accession) 
    INNER JOIN assemblies USING (assembly_id) 
    LEFT JOIN (
      SELECT replacement_label_or_cds_code AS cds_code, *
      FROM orthologous_groups
    ) USING (gene_family_id, cds_code)
   WHERE gene_family_id IS NOT NULL 
   AND ((ortholog_col_id=1) OR (ortholog_col_id IS NULL))
   AND code IN ('FAM36','FAR37','FERACI1','FERACI2');
  SELECT distinct locus_tag, go_id 
   FROM (
    SELECT gene_family_id, og_id, count(*) AS size, count(distinct code) AS genome_present
     FROM panclade4 
    GROUP BY gene_family_id, og_id
   ) AS q1
   INNER JOIN (
    SELECT * from panclade4
    WHERE code='FAM36'
   ) AS q2 USING (gene_family_id, og_id)
   LEFT JOIN functional_annotations USING (nr_protein_id) 
   LEFT JOIN interpro2GO USING (interpro_id) 
  WHERE size=4 AND genome_present=4

-rw-r--r-- 1 1000 1000 0 Oct 12 21:21 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade4_coregenome_terms.tab
-rw-r--r-- 1 1000 1000 0 Oct 12 21:21 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade4_coregenome_terms.tab_nonull
clade5  (repr.: 'FAM36'; size: 3) 'FAM36','FAR37','FERACI2'
#
  CREATE TEMP TABLE panclade5 AS 
   SELECT cds_code, locus_tag, nr_protein_id, gene_family_id, og_id, code 
    FROM coding_sequences 
    INNER JOIN replicons USING (genomic_accession) 
    INNER JOIN assemblies USING (assembly_id) 
    LEFT JOIN (
      SELECT replacement_label_or_cds_code AS cds_code, *
      FROM orthologous_groups
    ) USING (gene_family_id, cds_code)
   WHERE gene_family_id IS NOT NULL 
   AND ((ortholog_col_id=1) OR (ortholog_col_id IS NULL))
   AND code IN ('FAM36','FAR37','FERACI2');
  SELECT distinct locus_tag, go_id 
   FROM (
    SELECT gene_family_id, og_id, count(*) AS size, count(distinct code) AS genome_present
     FROM panclade5 
    GROUP BY gene_family_id, og_id
   ) AS q1
   INNER JOIN (
    SELECT * from panclade5
    WHERE code='FAM36'
   ) AS q2 USING (gene_family_id, og_id)
   LEFT JOIN functional_annotations USING (nr_protein_id) 
   LEFT JOIN interpro2GO USING (interpro_id) 
  WHERE size=3 AND genome_present=3

-rw-r--r-- 1 1000 1000 0 Oct 12 21:21 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade5_coregenome_terms.tab
-rw-r--r-- 1 1000 1000 0 Oct 12 21:21 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade5_coregenome_terms.tab_nonull
clade6  (repr.: 'FAM36'; size: 2) 'FAM36','FAR37'
#
  CREATE TEMP TABLE panclade6 AS 
   SELECT cds_code, locus_tag, nr_protein_id, gene_family_id, og_id, code 
    FROM coding_sequences 
    INNER JOIN replicons USING (genomic_accession) 
    INNER JOIN assemblies USING (assembly_id) 
    LEFT JOIN (
      SELECT replacement_label_or_cds_code AS cds_code, *
      FROM orthologous_groups
    ) USING (gene_family_id, cds_code)
   WHERE gene_family_id IS NOT NULL 
   AND ((ortholog_col_id=1) OR (ortholog_col_id IS NULL))
   AND code IN ('FAM36','FAR37');
  SELECT distinct locus_tag, go_id 
   FROM (
    SELECT gene_family_id, og_id, count(*) AS size, count(distinct code) AS genome_present
     FROM panclade6 
    GROUP BY gene_family_id, og_id
   ) AS q1
   INNER JOIN (
    SELECT * from panclade6
    WHERE code='FAM36'
   ) AS q2 USING (gene_family_id, og_id)
   LEFT JOIN functional_annotations USING (nr_protein_id) 
   LEFT JOIN interpro2GO USING (interpro_id) 
  WHERE size=2 AND genome_present=2

-rw-r--r-- 1 1000 1000 0 Oct 12 21:21 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade6_coregenome_terms.tab
-rw-r--r-- 1 1000 1000 0 Oct 12 21:21 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade6_coregenome_terms.tab_nonull
step 5.1: completed generating core genome background term distribution for each clade in the tree based on ortholog collection 1

step 5.2: generating pangenome background term distribution for clades
#
select distinct locus_tag, go_id 
 from coding_sequences 
 left join functional_annotations using (nr_protein_id) 
 left join interpro2GO using (interpro_id)

-rw-r--r-- 1 1000 1000 764K Oct 12 21:21 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/9-genomes_pangenome_terms.tab
-rw-r--r-- 1 1000 1000 562K Oct 12 21:21 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/9-genomes_pangenome_terms.tab_nonul
clade0  'CUNDIV1','CUNDIV2','GPL37'
#select distinct locus_tag, go_id from coding_sequences 
  inner join replicons using (genomic_accession)
  inner join assemblies using (assembly_id)
  left join functional_annotations using (nr_protein_id) 
  left join interpro2GO using (interpro_id) 
  where code in ('CUNDIV1','CUNDIV2','GPL37')
-rw-r--r-- 1 1000 1000 250K Oct 12 21:21 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade0_pangenome_terms.tab
-rw-r--r-- 1 1000 1000 183K Oct 12 21:21 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade0_pangenome_terms.tab_nonull
clade1  'ACIDIP','FAM36','FAR37','FERACI1','FERACI2','FTT37'
#select distinct locus_tag, go_id from coding_sequences 
  inner join replicons using (genomic_accession)
  inner join assemblies using (assembly_id)
  left join functional_annotations using (nr_protein_id) 
  left join interpro2GO using (interpro_id) 
  where code in ('ACIDIP','FAM36','FAR37','FERACI1','FERACI2','FTT37')
-rw-r--r-- 1 1000 1000 515K Oct 12 21:21 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade1_pangenome_terms.tab
-rw-r--r-- 1 1000 1000 380K Oct 12 21:21 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade1_pangenome_terms.tab_nonull
clade2  'CUNDIV1','CUNDIV2'
#select distinct locus_tag, go_id from coding_sequences 
  inner join replicons using (genomic_accession)
  inner join assemblies using (assembly_id)
  left join functional_annotations using (nr_protein_id) 
  left join interpro2GO using (interpro_id) 
  where code in ('CUNDIV1','CUNDIV2')
-rw-r--r-- 1 1000 1000 172K Oct 12 21:21 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade2_pangenome_terms.tab
-rw-r--r-- 1 1000 1000 126K Oct 12 21:21 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade2_pangenome_terms.tab_nonull
clade3  'FAM36','FAR37','FERACI1','FERACI2','FTT37'
#select distinct locus_tag, go_id from coding_sequences 
  inner join replicons using (genomic_accession)
  inner join assemblies using (assembly_id)
  left join functional_annotations using (nr_protein_id) 
  left join interpro2GO using (interpro_id) 
  where code in ('FAM36','FAR37','FERACI1','FERACI2','FTT37')
-rw-r--r-- 1 1000 1000 424K Oct 12 21:21 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade3_pangenome_terms.tab
-rw-r--r-- 1 1000 1000 313K Oct 12 21:21 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade3_pangenome_terms.tab_nonull
clade4  'FAM36','FAR37','FERACI1','FERACI2'
#select distinct locus_tag, go_id from coding_sequences 
  inner join replicons using (genomic_accession)
  inner join assemblies using (assembly_id)
  left join functional_annotations using (nr_protein_id) 
  left join interpro2GO using (interpro_id) 
  where code in ('FAM36','FAR37','FERACI1','FERACI2')
-rw-r--r-- 1 1000 1000 349K Oct 12 21:21 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade4_pangenome_terms.tab
-rw-r--r-- 1 1000 1000 258K Oct 12 21:21 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade4_pangenome_terms.tab_nonull
clade5  'FAM36','FAR37','FERACI2'
#select distinct locus_tag, go_id from coding_sequences 
  inner join replicons using (genomic_accession)
  inner join assemblies using (assembly_id)
  left join functional_annotations using (nr_protein_id) 
  left join interpro2GO using (interpro_id) 
  where code in ('FAM36','FAR37','FERACI2')
-rw-r--r-- 1 1000 1000 254K Oct 12 21:21 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade5_pangenome_terms.tab
-rw-r--r-- 1 1000 1000 187K Oct 12 21:21 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade5_pangenome_terms.tab_nonull
clade6  'FAM36','FAR37'
#select distinct locus_tag, go_id from coding_sequences 
  inner join replicons using (genomic_accession)
  inner join assemblies using (assembly_id)
  left join functional_annotations using (nr_protein_id) 
  left join interpro2GO using (interpro_id) 
  where code in ('FAM36','FAR37')
-rw-r--r-- 1 1000 1000 168K Oct 12 21:21 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade6_pangenome_terms.tab
-rw-r--r-- 1 1000 1000 122K Oct 12 21:21 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade6_pangenome_terms.tab_nonull
step 5.2: completed 

step 6: comparing each clade-specific core genome to its respective core genome
ERROR: step 6: failed comparing each clade-specific core genome to its respective core genome; check specific logs in '/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/logs/GOterm_enrichment/cladespecific_vs_coregenome_genes*' for more details
ERROR: Pantagruel pipeline task 8: failed.

I hope that helps!

Hi Carlos, Thanks for the above, we're closing in on the problem. it seems the SQL queries are what they should be, but still not yielding the expected output, so there may be something wrong with the underlying data. I notice you have the following message at step 3:

1495 families not covererd by orthology classification (means no evolution scenario was inferred for these families)
0 families covererd by orthology classification into a total of 0 orthologous groups

so that means that the distribution of genes in genomes is considered at the homologous gene family level, not the orthologous gene sub-family derived from the gene tree reconciliations. There must be an issue that prevents propagating the info from reconciliations / task 07. On top of it being a shame (using reconciliation information is the whole point of Pantagruel), it could lead to downstream issues and prevent the smooth running of the task 08.

So I'm sorry but I'll ask you to run a few more diagnostics if you don't mind:

source /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/environ_pantagruel_db_sc3.sh

cat ${alerec}/reccol

echo ${recs}/${collapsecond}/${replmethod}/${reccol}
ls -lh ${recs}/${collapsecond}/${replmethod}/${reccol}/ | head
ls ${recs}/${collapsecond}/${replmethod}/${reccol}/ | wc -l

sqlite3 -cmd ".headers on" ${sqldb} "SELECT COUNT(*) FROM gene_tree_label2cds_code;"
sqlite3 -cmd ".headers on" ${sqldb} "SELECT * FROM gene_tree_label2cds_code LIMIT 10;"

sqlite3 -cmd ".headers on" ${sqldb} "SELECT COUNT(*) FROM gene_lineage_events;"
sqlite3 -cmd ".headers on" ${sqldb} "SELECT * FROM gene_lineage_events LIMIT 10;"

Also, it could be useful if you could please attach the logs from the step 1, that should be stored in there: logs/get_orthologues_from_ALE_recs_ortholog_collection_1.log

thanks a lot for your patience.

Cheers, Florent

Going back to your previous posts, I have the impression that we're still on the same issue as in #39 and that it was not solved then. Can you confirm you could run the reconciliation task 07 and obtained some output? (the ls commands above should verify that). Please let me know if you have issues running the source command above to load the environment variables for the diagnostics commands

Hello Florent,

Here's the output I got from running the source commands:

/pantagruel/scripts/pipeline/environ_pantagruel_secondaryvars.sh:62: bad substitution
/pantagruel/scripts/pipeline/pantagruel_pipeline_functions.sh:export:7: invalid option(s)
/pantagruel/scripts/pipeline/pantagruel_pipeline_functions.sh:export:43: invalid option(s)
/pantagruel/scripts/pipeline/pantagruel_pipeline_functions.sh:export:56: invalid option(s)
1   2020-10-10  using ALE software (version v1.0) compiled from source; code origin: https://github.com/ssolo/ALE; code version 265fc4de061f47a4f38c51dc9cfc7a3dda05654e    ale_fullgenetree_dated_1
/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/07.reconciliations/fullgenetree_ALE_recs///
total 4,0K
drwxr-xr-x 3 carlos carlos 4,0K oct 10 15:53 nocollapse
1

And as you said, the log is empty, so it might be to do with step 07 still, last time I ran it it still not work properly, I had to run the following commands between step 06 and 07 to make it work, the issue might be in there?

source /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/environ_pantagruel_db_sc3.sh
repltasklist=${bayesgenetrees}_${collapsecond}_nexus_list_resume
ptgthreads=1
repllogs=${ptgdb}/logs/replspebypop/replace_species_by_pop_in_gene_trees
replrun=test
export collapsecond='nocollapse'
docker run -u $UID:$UID -v $PWD:$PWD -w $PWD panta python2.7 ${ptgscripts}/replace_species_by_pop_in_gene_trees.py \
  -G ${repltasklist} --no_replace -o ${coltreechains}/${collapsecond} --threads=${ptgthreads} \
  --reuse=0 --verbose=2 --logfile=${repllogs}_${replrun}.log

I had to do so because for some reason collapsecond was undefined, leading to the last step of step 06 not running and step 07 failing

OK so it seems that the problem is indeed that you don't have any reconciliation data from step 07, and that probably stems from a deeper problem with the scripts not running fine. Let's get down to the basics:

what is your OS?
what is your shell scripting environment? (it seems that it is not bash, or at least not the one provided by debian/ubuntu, with which I developed the scripts)
importantly: how do you run the pipeline? as a script from you own OS? or within the docker image? if so is it the dockerhub build or your own? I suspect your OS is not appropriate to run pantagruel as a script, and I would recommend you use the dockerhub image instead.

just to rule out stuff, can you please re-run the diagnostic commands; this time I won't make them rely on the environment variables that you can't source (please adapt the paths if I guessed wrong):

cd /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/
ls -lh 07.reconciliations/fullgenetree_ALE_recs/nocollapse/noreplace/ale_fullgenetree_dated_1/ | head
ls 07.reconciliations/fullgenetree_ALE_recs/nocollapse/noreplace/ale_fullgenetree_dated_1/ | wc -l

sqlite3 -cmd ".headers on" 03.database/db_sc3 "SELECT COUNT(*) FROM gene_tree_label2cds_code;"
sqlite3 -cmd ".headers on" 03.database/db_sc3 "SELECT * FROM gene_tree_label2cds_code LIMIT 10;"

sqlite3 -cmd ".headers on" 03.database/db_sc3 "SELECT COUNT(*) FROM gene_lineage_events;"
sqlite3 -cmd ".headers on" 03.database/db_sc3 "SELECT * FROM gene_lineage_events LIMIT 10;"

I am running it on Ubuntu 18.04.5 LTS
You guessed right, I use zsh as a default shell but if that was the issue I will of course switch to bash, but I doubt that'll be the issue because,
I ran the pipeline using my own build of the Docker container, simply because I needed to run Interproscan. I have been running the steps you mentioned here using the dockerhub build to ensure the latest version.

This is what I get from the diagnostics

total 94M
-rw-r--r-- 1 carlos carlos  303 oct 10 17:25 core-genome-based_reference_tree_db_sc3.full.lsd.nwk_PANTAGFAMC000001-Gtrees.nwk.ale.cons_tree
-rw-r--r-- 1 carlos carlos  64K oct 10 17:25 core-genome-based_reference_tree_db_sc3.full.lsd.nwk_PANTAGFAMC000001-Gtrees.nwk.ale.ml_rec
-rw-r--r-- 1 carlos carlos  47K oct 10 17:25 core-genome-based_reference_tree_db_sc3.full.lsd.nwk_PANTAGFAMC000001-Gtrees.nwk.ale.Ts
-rw-r--r-- 1 carlos carlos  220 oct 10 16:21 core-genome-based_reference_tree_db_sc3.full.lsd.nwk_PANTAGFAMC000002-Gtrees.nwk.ale.cons_tree
-rw-r--r-- 1 carlos carlos  31K oct 10 16:21 core-genome-based_reference_tree_db_sc3.full.lsd.nwk_PANTAGFAMC000002-Gtrees.nwk.ale.ml_rec
-rw-r--r-- 1 carlos carlos  188 oct 10 16:21 core-genome-based_reference_tree_db_sc3.full.lsd.nwk_PANTAGFAMC000002-Gtrees.nwk.ale.Ts
-rw-r--r-- 1 carlos carlos  220 oct 10 16:28 core-genome-based_reference_tree_db_sc3.full.lsd.nwk_PANTAGFAMC000005-Gtrees.nwk.ale.cons_tree
-rw-r--r-- 1 carlos carlos  41K oct 10 16:28 core-genome-based_reference_tree_db_sc3.full.lsd.nwk_PANTAGFAMC000005-Gtrees.nwk.ale.ml_rec
-rw-r--r-- 1 carlos carlos  29K oct 10 16:28 core-genome-based_reference_tree_db_sc3.full.lsd.nwk_PANTAGFAMC000005-Gtrees.nwk.ale.Ts
8970

I hope that helps, thank you so much again! Best, Carlos

OK the good news is that you do have the reconciliations. the annoying news is that there must be something broken in the code if it does not work from the docker image. If you don't mind, I'll ask you another few diagnostics (below) so we can pinpoint where is the issue. For these diagnostics, please run a bash session first so it does the expected thing.

cd /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/
ls -lh 07.reconciliations/parsed_recs/ale_fullgenetree_dated_1_parsed_1/

Also, did you have any output from the sqlite commands? if they failed du to the wrong environment, please can you try again from the bash shell? I think there might be an issue in the step that populates the SQLite db with reconciliation information, at the end of task 07.

If you can find the logs from task 07 it would help too.

Another thing, you said "last time" you had to run the final step of task 06 manually; did you have to/did that this time? You having the reconciliation output, I assume that this step was done one way or another - but ideally it should not have to be done manually. Sorry for the mess we're uncovering in this pipeline!

Hello Florent, sorry for the delay in the response, here are the outputs for the latest ls -lh command

total 272K
drwxr-xr-x 2 carlos carlos 256K oct 10 18:14 gene_tree_lineages
drwxr-xr-x 2 carlos carlos 4,0K oct 10 18:12 ref_species_tree
-rw-r--r-- 1 carlos carlos  771 oct 10 22:56 summary_gene_tree_events_minfreq0.1
-rw-r--r-- 1 carlos carlos 3,6K oct 10 22:56 summary_gene_tree_events_minfreq0.1.species_tree_density.pdf
-rw-r--r-- 1 carlos carlos    0 oct 10 22:56 summary_gene_tree_events_minfreq0.25
-rw-r--r-- 1 carlos carlos    0 oct 10 22:56 summary_gene_tree_events_minfreq0.5

And here for the SQLite one (sorry I don't know what happened before but I didn't seem to have copied it):

COUNT(*)
12545
replacement_label_or_cds_code|cds_code
ACIDIP_10|ACIDIP_10
ACIDIP_100|ACIDIP_100
ACIDIP_1000|ACIDIP_1000
ACIDIP_1001|ACIDIP_1001
ACIDIP_1002|ACIDIP_1002
ACIDIP_1003|ACIDIP_1003
ACIDIP_1004|ACIDIP_1004
ACIDIP_1005|ACIDIP_1005
ACIDIP_1006|ACIDIP_1006
ACIDIP_1007|ACIDIP_1007
COUNT(*)
116653
event_id|replacement_label_or_cds_code|freq|reconciliation_id
131|ACIDIP_1226|35|1
150|ACIDIP_1226|23|1
35|ACIDIP_1226|14|1
36|ACIDIP_1226|13|1
168|ACIDIP_1226|10|1
321|ACIDIP_1226|14|1
323|ACIDIP_1226|10|1
324|ACIDIP_1226|20|1
339|ACIDIP_1226|11|1
340|ACIDIP_1226|39|1

As for the final step of task 06, yes I had to do it manually this time around too, as I said collapsecond was not assigned for some reason. No worries at all, you have been extremely helpful and as far as we know these issues might have to do with the way I'm running it or the data I am using

Hi Carlos,

it seems we have a complex issue here with mutiple sources.

I'm still at loss why the environment variable ${collapsecond} is not loaded in task 6 or possibly in the whole pipeline.

can you please attach the environment file of your database (the one passed to option -i)?

Also it seems that the issue with task 08 starts with the inability to generate orthologous gene clusters based on reconciliation information (step 1), even though it seems there is definitely all the input material for that.

I found one possible source of the problem here. Looking closely at the command that was run, it seems that the option --use.unreconciled.gene.trees= is fed nothing, instead of the value of ${mbgenetrees}, which is not defined, it should have been ${mboutputdir}. I fixed that in bcc10ce; I also introduced some basic verbosity to see what is going on.

just to have a better feel of what the script does, can you please run manually the following command, and store the STDIN and STDERR (together, using &> redirection for instance)?

python2.7 /pantagruel/scripts/get_orthologues_from_ALE_recs.py -i /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/07.reconciliations/fullgenetree_ALE_recs/nocollapse/noreplace/ale_fullgenetree_dated_1 -o /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1 --threads=1  --ale.model=dated --methods=mixed --max.frac.extra.spe=0.5 --majrule.combine=0.5 --colour.combined.tree --use.unreconciled.gene.trees=/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/06.gene_trees/fullgenetree_mrbayes_trees/nocollapse --unreconciled.format=nexus --unreconciled.ext=.con.tre --verbose

Once you've updated the code, you can also run the pipeline task 08 and see if that fixes some issues; this time I would use the option -F so to have a clean start.

I hope we'll get some better results here! Cheers Florent

Hello Florent, Here's the environment:

#!/bin/bash
## Pantagruel database 'db_sc3'
## built with Pantagruel version '55e275b525a5d4ca40b278ff9523a0f89447468c'; source code available at 'https://github.com/flass/pantagruel'

# the init command with which the config file was created
ptginitcmd='pantagruel -F -d db_sc3 -r panta_out -f PANTAGFAM -I ucfasej@ucl.ac.uk -A ref_seqs -a SC3_organisms_fa init'

# location (folder) of Pantagruel software that was used
export ptgrepo='/pantagruel'
# derive paths to Pantagruel scripts and Python modules
export ptgscripts=${ptgrepo}/scripts
export PYTHONPATH=${ptgrepo}/python_libs
# database parameters (primary variables)
export ptgroot='/home/carlos/Desktop/genomes_archea/panta_out'                            # root folder where to build the database
export ptgdbname='db_sc3'                        # name of dataabse
export ptgversinit='55e275b525a5d4ca40b278ff9523a0f89447468c'                    # current version of Pantagruel software
export myemail='ucfasej@ucl.ac.uk'                            # user identity (better use e-amil address)
export famprefix='PANTAGFAM'                        # gene family prefix
export ncbitax='/home/carlos/Desktop/genomes_archea/panta_out/NCBI/Taxonomy_2020-09-24'                            # folder of up-to-date NCBI Taxonomy database
export ncbiass='/home/carlos/Desktop/genomes_archea/ref_seqs'                            # folder of RefSeq genomes to include in the study
export listncbiass=''                    # list of accessions of RefSeq genomes to include in the study
export customassemb='/home/carlos/Desktop/genomes_archea/SC3_organisms_fa'                  # folder of custom genome assemblies to include in the study
export refass=''                              # folder of reference (RefSeq) genomes only to use as reference for the annotation of custom genome assemblies
export listrefass=''                      # list of accessions of reference (RefSeq) genomes only to use as reference for the annotation of custom genome assemblies
export coreseqtype='cds'                    # either 'cds' or 'protein'
export pseudocoremingenomes=    # the minimum number of genomes in which a gene family should be present to be included in the pseudo-core genome gene set
export userreftree=''                    # possible user-provided reference tree
export poplgthresh='default'                    # parameter to define populations of genomes in the reference tree (stem branch length threshold, default value depends on coreseqtype)
export poplgleafmul='1.5'                  # parameter to define populations of genomes in the reference tree (multiplier to the former in case it is a leaf, default 1.5)
export popbsthresh='80'                    # parameter to define populations of genomes in the reference tree (stem branch support threshold, default 80)
export rootingmethod='treebalance'                # rooting method for core-genome tree
export snpali=''                              # restrict core-genome alignment to SNPs
export chaintype='fullgenetree'                        # whether gene trees will be collapsed ('collapsed', if -c option enabled) or not ('fullgenetree', default)
export genefamlist=''                    # list of gene families for which computation of gene trees and all subsequent analyses will be restricted (default: no restriction)
export preferredgenomes=''          # list of genome codes to use as preferred representative in the listing of genes in clade-specific gene lists (default: none)
export recmethod='ALE'                        # genetree/species tree reconciliation method: 'ALE' or 'ecceTERA'
# non-default parameters for gene trees collapsing derived from -C option value (passed to init script via ${collapseCladeParams}):
export cladesupp=70                          # - clade criterion trheshold (int)
export subcladesupp=35                    # - wihtin-clade criterion trheshold (int)
export criterion='bs'                        # - criterion (branch support: 'bs', branch length 'lg')
export withinfun='median'                        # - aggregate function for testing within the clade ('min', 'max', 'mean', 'median')
export hpcremoteptgroot='none'          # if not empty nor 'none', will use this server address to send data and scripts to run heavy computions there
export maxreftreeheight='0.5'          # restict events younger than that age (comprised in [0.0; 1.0]) on the species tree for gene co-evolution scoring
export updatedbfrom=''                  # the current pantagruel database is an update from that found at this path
export customstraininfo=''          # optional custom strain information file
export pathtoipscan=''                  # optional path to interproscan executable
## other parameters have default values defined in the generic source file environ_pantagruel_defaults.sh
source ${ptgscripts}/pipeline/environ_pantagruel_defaults.sh
## these defalts can be overriden by uncommenting the relevant line below and editing the variable's value
## or (recomended for changes to last past calls to `pantagruel --refresh init`):
## create a file '${ptgroot}/${ptgdbname}/user_environ_pantagruel_${ptgdbname}.sh' containing the `export variable=value` commands
# default values are:
# Prokka annotation parameters (only relevant if custom genome assemblies are provided):
#~ export assembler="somesoftware"
#~ export seqcentre="somewhere"
#~ export refgenus="Reference"
# species tree inference parameters
#~ export ncorebootstrap=200
# gene tree inference parameters
#~ export mainresulttag='rootedTree'
# gene trees collapsing DEFAULT values (used when -C option is NOT present in init call)
#~ export cladesuppdef=70
#~ export subcladesuppdef=35
#~ export criteriondef='bs'
#~ export withinfundef='median'
# gene tree/species tree reconciliation inference parameters
#~ export ALEalgo='ALEml'
#~ export ecceTERAalgo='amalgamate'
#~ export recsamplesize=1000
# gene tree/species tree reconciliation parsing parameters for co-evolution analysis
#~ export evtypeparse='ST'
#~ export minevfreqparse=0.1
#~ export minevfreqmatch=0.5
#~ export minjoinevfreqmatch=1.0
#~ export maxreftreeheight=0.25
userparams="${ptgroot}/${ptgdbname}/user_environ_pantagruel_${ptgdbname}.sh"
if [ -s "${userparams}" ] ; then
  echo "Warning: will use user-defined values for Pantagruel environment variables, as deined in '${userparams}':"
  cat ${userparams}
  source ${userparams}
fi

# secondary vars are defined based on the above
source ${ptgscripts}/pipeline/environ_pantagruel_secondaryvars.sh
# load shared functions
source ${ptgscripts}/pipeline/pantagruel_pipeline_functions.sh

And the stdout for that command, I might have missed something silly here:

Traceback (most recent call last):
  File "/pantagruel/scripts/get_orthologues_from_ALE_recs.py", line 325, in <module>
    'skip.reconciled', 'threads=', 'verbose=', 'help'])
  File "/usr/lib/python2.7/getopt.py", line 88, in getopt
    opts, args = do_longs(opts, args[0][2:], longopts, args[1:])
  File "/usr/lib/python2.7/getopt.py", line 156, in do_longs
    raise GetoptError('option --%s requires argument' % opt, opt)
getopt.GetoptError: option --verbose requires argument

Re-running step 08 with the new version and -R I get a new error:

This is Pantagruel pipeline version f86442e113e2bed426db3bd57ebeb94f4de15069 using source code from repository '/pantagruel' will try and resume computation of task where it was last stopped
# will run tasks: 8
[2020-10-22 19:08:32] Pantagruel pipeline task 8: classify genes into orthologous groups (OGs) and search clade-specific OGs.
Task folder '/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs' already exists; -R|--resume option was used so Pantagruel will atempt to resume from an interupted previous run
generating ortholog collection from reconciled gene trees
# call: python2.7 /pantagruel/scripts/get_orthologues_from_ALE_recs.py -i /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/07.reconciliations/fullgenetree_ALE_recs/nocollapse/noreplace/ale_fullgenetree_dated_1 -o /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1 --threads=4  --ale.model=dated --methods=mixed --max.frac.extra.spe=0.5 --majrule.combine=0.5 --colour.combined.tree --use.unreconciled.gene.trees= --unreconciled.format=nexus --unreconciled.ext=.con.tre &> /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/logs/get_orthologues_from_ALE_recs_ortholog_collection_1.log
step 1: complete generating ortholog collection from reconciled gene trees

importing ortholog classification into database
first delete previous records for this ortholog collection ('ortholog_collection_1') in the database '/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/03.database/db_sc3'
step 2.0: completed importing ortholog collection record into database
step 2.1: completed importing ortholog classification into database for reconciled gene trees
step 2.2: completed importing ortholog classification into database for unreconciled gene trees

step 3: generating abs/pres matrix
ortholog_collection_1
Traceback (most recent call last):
  File "/pantagruel/scripts/get_ortholog_presenceabsence_matrix_from_sqlitedb.py", line 24, in <module>
    orfanfam = int(sys.argv[4])
ValueError: invalid literal for int() with base 10: 'PANTAGFAMC000000'
ERROR: step 3: failed generating abs/pres matrix
ERROR: Pantagruel pipeline task 8: failed.

Hi Carlos, thanks for that.

your environment file seems fine to me; my doubt was about the $chaintype variable to not be set with the right value but it is correctly set as chaintype='fullgenetree'. So I still cannot see why the collapsecond was not set correctly when executing the pipeline task 06 and later. I do think the issu is with your OS struggling to source the secondary scripts that define all the tasks-specific environment variables (the source ${ptgscripts}/pipeline/environ_pantagruel_secondaryvars.sh in the main environment file). I made an attempt to fix that in eb7edfc by changing the shebangs of all pipeline scripts from #!/bin/bash to the more robust #!/usr/bin/env bash; hopefully that will make your system to read and execute these files properly.
my bad about the the get_orthologues_from_ALE_recs.py command; to get the verbose output you need to use either option -v or --verbose=1.
also I'm really sorry but I had not pushed the last changes I made 😅 before asking you to re-run the pipeline. the fix about the variable ${mboutputdir} is a fix included in bcc10ce; that may save the day (or not).
So can you please again (sorry 😬 ) run the pipeline task 08 with the latest version of master? I suggest you use pantagruel -i db_sc3/environ_pantagruel_db_sc3.sh -F 08 to have a clean start.
once you have done that it would be helpfull to see what has be sent to this log file: /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/logs/get_orthologues_from_ALE_recs_ortholog_collection_1.log

Cheers Florent

and for your last error, that was a bug a introduced in my last modification of get_ortholog_presenceabsence_matrix_from_sqlitedb.py; this is fixde in 4325b5a.

Hello Florent, I have attached the output for the command you mentioned with the correct -v. Step 08 with the latest -F fails with the error:

This is Pantagruel pipeline version c76ac373e882f1ea4739dbc54987008a53257fdd using source code from repository '/pantagruel' (branch: 'master')

# will run tasks: 8
[2020-10-24 12:56:32] Pantagruel pipeline task 8: classify genes into orthologous groups (OGs) and search clade-specific OGs.
Task folder '/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs' already exists; FORCE mode is on: ERASE and recreate the folder to write new result in its place
generating ortholog collection from reconciled gene trees
# call: python2.7 /pantagruel/scripts/get_orthologues_from_ALE_recs.py -i /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/07.reconciliations/fullgenetree_ALE_recs/nocollapse/noreplace/ale_fullgenetree_dated_1 -o /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1 --threads=4  --ale.model=dated --methods=mixed --max.frac.extra.spe=0.5 --majrule.combine=0.5 --colour.combined.tree --use.unreconciled.gene.trees=/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/06.gene_trees/fullgenetree_mrbayes_trees/nocollapse --unreconciled.format=nexus --unreconciled.ext=.con.tre &> /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/logs/get_orthologues_from_ALE_recs_ortholog_collection_1.log
step 1: complete generating ortholog collection from reconciled gene trees

importing ortholog classification into database
Error: UNIQUE constraint failed: ortholog_collections.ortholog_col_id
ERROR: step 2.0: failed when importing ortholog collection record into database
ERROR: Pantagruel pipeline task 8: failed.

Hence why I ran it with -R, but there I still get an error (although further down the line!):

This is Pantagruel pipeline version c76ac373e882f1ea4739dbc54987008a53257fdd using source code from repository '/pantagruel' (branch: 'master')

will try and resume computation of task where it was last stopped
# will run tasks: 8
[2020-10-24 12:59:09] Pantagruel pipeline task 8: classify genes into orthologous groups (OGs) and search clade-specific OGs.
Task folder '/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs' already exists; -R|--resume option was used so Pantagruel will atempt to resume from an interupted previous run
generating ortholog collection from reconciled gene trees
# call: python2.7 /pantagruel/scripts/get_orthologues_from_ALE_recs.py -i /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/07.reconciliations/fullgenetree_ALE_recs/nocollapse/noreplace/ale_fullgenetree_dated_1 -o /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1 --threads=4  --ale.model=dated --methods=mixed --max.frac.extra.spe=0.5 --majrule.combine=0.5 --colour.combined.tree --use.unreconciled.gene.trees=/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/06.gene_trees/fullgenetree_mrbayes_trees/nocollapse --unreconciled.format=nexus --unreconciled.ext=.con.tre &> /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/logs/get_orthologues_from_ALE_recs_ortholog_collection_1.log
step 1: complete generating ortholog collection from reconciled gene trees

importing ortholog classification into database
first delete previous records for this ortholog collection ('ortholog_collection_1') in the database '/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/03.database/db_sc3'
step 2.0: completed importing ortholog collection record into database
step 2.1: completed importing ortholog classification into database for reconciled gene trees
step 2.2: completed importing ortholog classification into database for unreconciled gene trees

step 3: generating abs/pres matrix
ortholog_collection_1
building matrix of gene presence / absence for 9 genomes
examining a total of 14585 CDSs with non-ORFan family assignment
retrieveing orthology classification from collection: ortholog_col_id=1
2293 families not covered by orthology classification (means no evolution scenario was inferred for these families)
0 families covered by orthology classification into a total of 0 orthologous groups
these totalize 31 families with unique representative in the dataset (singletons) and 2262 others [total: 2293]
step 3: completed generating abs/pres matrix

listing clade-specific orthologs
step 4: completed listing clade-specific orthologs

null device 
          1 
Found 52432 functional annotation records linked to GO terms in the database '/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/03.database/db_sc3'
Will now run GO term enrichment tests
step 5.1: generating core genome background term distribution for clades
step 5.0: generating core genome background term distribution for the whole dataset (repr.: 'ACIDIP'; size: 9)
-rw-r--r-- 1 1000 1000 0 Oct 24 13:01 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/9-genomes_coregenome_terms.tab
-rw-r--r-- 1 1000 1000 0 Oct 24 13:01 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/9-genomes_coregenome_terms.tab_nonull
step 5.1: generating core genome background term distribution for each clade in the tree based on ortholog collection 1
clade0  (repr.: 'CUNDIV1'; size: 3) 'CUNDIV1','CUNDIV2','GPL37'
-rw-r--r-- 1 1000 1000 0 Oct 24 13:01 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade0_coregenome_terms.tab
-rw-r--r-- 1 1000 1000 0 Oct 24 13:01 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade0_coregenome_terms.tab_nonull
clade1  (repr.: 'ACIDIP'; size: 6) 'ACIDIP','FAM36','FAR37','FERACI1','FERACI2','FTT37'
-rw-r--r-- 1 1000 1000 0 Oct 24 13:01 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade1_coregenome_terms.tab
-rw-r--r-- 1 1000 1000 0 Oct 24 13:01 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade1_coregenome_terms.tab_nonull
clade2  (repr.: 'CUNDIV1'; size: 2) 'CUNDIV1','CUNDIV2'
-rw-r--r-- 1 1000 1000 0 Oct 24 13:01 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade2_coregenome_terms.tab
-rw-r--r-- 1 1000 1000 0 Oct 24 13:01 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade2_coregenome_terms.tab_nonull
clade3  (repr.: 'FAM36'; size: 5) 'FAM36','FAR37','FERACI1','FERACI2','FTT37'
-rw-r--r-- 1 1000 1000 0 Oct 24 13:01 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade3_coregenome_terms.tab
-rw-r--r-- 1 1000 1000 0 Oct 24 13:01 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade3_coregenome_terms.tab_nonull
clade4  (repr.: 'FAM36'; size: 4) 'FAM36','FAR37','FERACI1','FERACI2'
-rw-r--r-- 1 1000 1000 0 Oct 24 13:01 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade4_coregenome_terms.tab
-rw-r--r-- 1 1000 1000 0 Oct 24 13:01 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade4_coregenome_terms.tab_nonull
clade5  (repr.: 'FAM36'; size: 3) 'FAM36','FAR37','FERACI2'
-rw-r--r-- 1 1000 1000 0 Oct 24 13:01 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade5_coregenome_terms.tab
-rw-r--r-- 1 1000 1000 0 Oct 24 13:01 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade5_coregenome_terms.tab_nonull
clade6  (repr.: 'FAM36'; size: 2) 'FAM36','FAR37'
-rw-r--r-- 1 1000 1000 0 Oct 24 13:01 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade6_coregenome_terms.tab
-rw-r--r-- 1 1000 1000 0 Oct 24 13:01 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade6_coregenome_terms.tab_nonull
step 5.1: completed generating core genome background term distribution for each clade in the tree based on ortholog collection 1

step 5.2: generating pangenome background term distribution for clades
-rw-r--r-- 1 1000 1000 764K Oct 24 13:01 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/9-genomes_pangenome_terms.tab
-rw-r--r-- 1 1000 1000 562K Oct 24 13:01 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/9-genomes_pangenome_terms.tab_nonul
clade0  'CUNDIV1','CUNDIV2','GPL37'
-rw-r--r-- 1 1000 1000 250K Oct 24 13:01 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade0_pangenome_terms.tab
-rw-r--r-- 1 1000 1000 183K Oct 24 13:01 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade0_pangenome_terms.tab_nonull
clade1  'ACIDIP','FAM36','FAR37','FERACI1','FERACI2','FTT37'
-rw-r--r-- 1 1000 1000 515K Oct 24 13:01 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade1_pangenome_terms.tab
-rw-r--r-- 1 1000 1000 380K Oct 24 13:01 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade1_pangenome_terms.tab_nonull
clade2  'CUNDIV1','CUNDIV2'
-rw-r--r-- 1 1000 1000 172K Oct 24 13:01 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade2_pangenome_terms.tab
-rw-r--r-- 1 1000 1000 126K Oct 24 13:01 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade2_pangenome_terms.tab_nonull
clade3  'FAM36','FAR37','FERACI1','FERACI2','FTT37'
-rw-r--r-- 1 1000 1000 424K Oct 24 13:01 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade3_pangenome_terms.tab
-rw-r--r-- 1 1000 1000 313K Oct 24 13:01 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade3_pangenome_terms.tab_nonull
clade4  'FAM36','FAR37','FERACI1','FERACI2'
-rw-r--r-- 1 1000 1000 349K Oct 24 13:01 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade4_pangenome_terms.tab
-rw-r--r-- 1 1000 1000 258K Oct 24 13:01 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade4_pangenome_terms.tab_nonull
clade5  'FAM36','FAR37','FERACI2'
-rw-r--r-- 1 1000 1000 254K Oct 24 13:01 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade5_pangenome_terms.tab
-rw-r--r-- 1 1000 1000 187K Oct 24 13:01 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade5_pangenome_terms.tab_nonull
clade6  'FAM36','FAR37'
-rw-r--r-- 1 1000 1000 168K Oct 24 13:01 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade6_pangenome_terms.tab
-rw-r--r-- 1 1000 1000 122K Oct 24 13:01 /home/carlos/Desktop/genomes_archea/panta_out/db_sc3/08.orthologs/ortholog_collection_1/GeneOntology/clade_go_term_reference_sets/clade6_pangenome_terms.tab_nonull
step 5.2: completed 

step 6: comparing each clade-specific core genome to its respective core genome
ERROR: step 6: failed comparing each clade-specific core genome to its respective core genome; check specific logs in '/home/carlos/Desktop/genomes_archea/panta_out/db_sc3/logs/GOterm_enrichment/cladespecific_vs_coregenome_genes*' for more details
ERROR: Pantagruel pipeline task 8: failed.

I hope that helps! command_stdout.txt

Hi Carlos, Thanks you for your detailed feedback. Here is what I can gather from it:

The output of the get_orthologues_from_ALE_recs.py -v command confirms everything is done correctly at this step 1.
when you run the full pipeline, step 3 output shows that there is still an empty set of gene families covered by orthology classification as loaded in the sqlite db at step 2
the core genome background term distribution are still empty at step 5

I fixed some of the minor bugs that were revealed by your last commands :

Error in get_orthologues_from_ALE_recs.py when using -v (there was a bug in a rarely used print line of code);
I made the pipeline clean the database when using option -F like it does whn using -R so you will be able to repeat a task with -F when database has been edited. (for task 8 -R and -F do exactly the same thing now)

But largely i'm still at loss on what is going on. I think that at this stage the most efficient way forward would be that you share some of your data with me so I can run test and work through it. If you don't mind sharing those (privately!), i would ask you first for the SQLite database file.

Best Florent

Hello Florent, Yes, U have checked and sharing the SQLite database should not be an issue, what would be the best way to share it privately? Best, Carlos

Hi Carlos,

thank you for sharing your database. Thanks to that, I think I put the finger on the main problem, the one preventing the generation of the clade's core genome GO term background distribution files and thus preventing the testing of GO term enrichment. It was due to a wrong assumption on my part that SQL INNER JOIN can match rows on NULL values; it can't (see here https://stackoverflow.com/questions/2123006/inner-join-on-null-value/2126023). That's fixed in commit 983bbe0.

However, the reason I never really notice this issue before is that there there should not be only NULL values in the og_id field of the orthologous_groups table; but that's what you have. It means that the whole benefit of running the gene tree reconciliations to define orthologous gene groups is lost as the information is not loaded in the database.
That must happen either during the parsing of the ALE output (step 1) by the script get_orthologues_from_ALE_recs.py, or during the loading of the data in the SQLite db (step 2). To know what's going on at step 1, we can look at the output that's supposed to be dumped into db_sc3/08.orthologs/ortholog_collection_1/mixed/. You should have a large set of files, 3 per gene family, something like:

PANTAGFAMC001671_mixed.orthologs.majrule_combined_0.500000
PANTAGFAMC001671_mixed.orthologs.majrule_combined_0.500000.nex
PANTAGFAMC001671_mixed.orthologs.majrule_combined_0.500000.pickle

if these are present and not empty, there is hope it's just a glitch in the database loading (step 2). if there is such files, can you please forward a set of such three file (relating to one gene family) so I have a look? if these are absent or empty, it's a bigger issue in the parsing (step 1).

Cheers Florent

Hi Carlos,

any chance you had a look at the above? I'd like to help you solve that dreadfully long-standing issue. Please share the indicated ortholog files when you can.

Best, florent

Hello Florent,

Sorry for the long silence, I have been tangled up with quite a lot of New Year deadlines. The files are there and not empty, so that seems to be good news? Find here attached these files, I had to add the 'txt' and 'gz' suffixes because otherwise GitHub would not let me upload the files Many thanks!

Best, Carlos

core_mixed.orthologs.majrule_combined_0.500000.nex.txt core_mixed.orthologs.majrule_combined_0.500000.pickle.gz core_mixed.orthologs.majrule_combined_0.500000.txt

Hi Carlos, happy New Year! thank you for the files, and no problem for the delay. i understand the situation, I'm in it too! It's good news that you have those files (I assume many of such triplets, you expect one per gene family that has been analysed with ALE). I'll try and look into this as soon as I have the time. Cheers,

Florent

Hi Carlos, I finally got it! it's due to a change in output file names from ALE since its version 0.5, where it is now named "name-of-the-secies-tree-file_name-of-gene-tree-file.ale.ml_rec" instead of "name-of-gene-tree-file.ale.ml_rec" as it used to be (cf. change in task 07 script ce7bfa3). Unfortunately, this was not reflected in task 08 and the ALE output parser get_orthologues_from_ALE_recs.py, in which the gene family id is extracted from the ALE output file name. As a result, the gene family id would would always be 'core', notably leading the script to overwrite all gene family results to the same file, which explains why in your case the folder db_sc3/08.orthologs/ortholog_collection_1/mixed/ only contains the three files you attached:

core_mixed.orthologs.majrule_combined_0.500000
core_mixed.orthologs.majrule_combined_0.500000.nex
core_mixed.orthologs.majrule_combined_0.500000.pickle

instead of one of these triplets per gene family, such as :

PANTAGFAMC000001_mixed.orthologs.majrule_combined_0.500000
PANTAGFAMC000001_mixed.orthologs.majrule_combined_0.500000.nex
PANTAGFAMC000001_mixed.orthologs.majrule_combined_0.500000.pickle
PANTAGFAMC000002_mixed.orthologs.majrule_combined_0.500000
PANTAGFAMC000002_mixed.orthologs.majrule_combined_0.500000.nex
PANTAGFAMC000002_mixed.orthologs.majrule_combined_0.500000.pickle
PANTAGFAMC000005_mixed.orthologs.majrule_combined_0.500000
PANTAGFAMC000005_mixed.orthologs.majrule_combined_0.500000.nex
PANTAGFAMC000005_mixed.orthologs.majrule_combined_0.500000.pickle
...

The parser now takes into account the ALE version tag so that it knows how to correctly deduct the gene family ids for the ALE output file names (changed in 5ee7a26).

I let you try again to run the whole task 08 see if it works now - you should make sure to use one of the -R or -F flags so that the database is cleaned from previous iterations.

thanks again for your patience and helping me maintain this software!

Cheers, Florent

NB: I just realised a version 1.0 of ALE has just been released last month and that's what you were using! it seems the core of ALE code and printed version tag has actually been changed in May 2020 https://github.com/ssolo/ALE/commit/265fc4de061f47a4f38c51dc9cfc7a3dda05654e)

Hi Carlos, have you had the chance to try the fix above? if that works for you I'd like to make this a small release (the first!) just to put some stable code out there from the master branch - at last - while I'm working on finishing the usingGeneRax branch (aimed for long-term support).

flass / pantagruel

ERROR: step 6: failed comparing each clade-specific core genome #45