Closed ChemaMD closed 5 years ago
Dear @ChemaMD, many thanks for reporting this bug. I will have a look at it now.
Kind regards
Dear @ChemaMD , I should have fixed it in the latest push (68fa321). It was a very silly and easy to solve bug, so many thanks for having reported!
Please let me know whether the latest pull will indeed fix it for you.
Kind regards
Thank you very much for such a quick response and solution! I will update and re-run.
Dear @lucventurini , the update solved the issue! Thank you very much for your help.
I'm now with Mikado pick, and it seems to run for long time without actually writing anything in any of the output files. The log gives some warnings and it appears as if the pipeline started checking the database but then got silence. I tried to run the same command with the flag --procs 5
and in all cases allocating ~30-40 Gb of memory in the cluster, and the log is similar. Is this normal?
Thank you for your help.
====== Pick log =====
2019-10-08 15:50:59,944 - main_logger - picker.py:350 - INFO - setup_logger - MainProcess - Begun analysis of mikado_prepared.gtf
2019-10-08 15:50:59,946 - main_logger - picker.py:352 - INFO - setup_logger - MainProcess - Command line: /data/home/btx333/.conda/envs/mikado2.1/bin/mikado pick --procs 1 --json-conf configuration.yaml --subloci-out mikado.subloci.gff3
2019-10-08 15:50:59,947 - listener - picker.py:369 - WARNING - setup_logger - MainProcess - Current level for queue: INFO
2019-10-08 15:51:00,131 - listener - picker.py:105 - INFO - __init__ - MainProcess - Random seed: 2834569384
2019-10-08 15:51:02,723 - Scaffold01:21107-135422 - locus.py:1286 - INFO - expand_transcript - MainProcess - gmap1_TRINITY_DN103619_c1_g2_i21.path1 has now start 93847, end 132507.
2019-10-08 15:51:02,727 - Scaffold01:21107-135422 - locus.py:195 - WARNING - finalize_alternative_splicing - MainProcess - gmap1_TRINITY_DN103619_c1_g2_i21.path1 is now padded
Dear @ChemaMD, mikado pick
should be quite fast, so if it is not writing down anything there is something that warrants investigation.
Would you please be able to send me the log file? Attached here would be fine, if it is small enough.
Best
I just lunched the following command:
mikado pick --json-conf configuration.yaml --subloci-out mikado.subloci.gff3 --shm --no-pad --log pick.log
I added the --no-pad
flag as in previous attempts there was a WARNING regarding the padding of one transcript (see previous message).
Below the log this run made (it is still running but not writing anything down). I consistently get a WARNING message with setup_logger - MainProcess - Current level for queue: INFO
. The genome has 30 scaffolds, and after finishing chromosome 30 it keeps running, but not outputting anything.
======= PICK LOG ======
2019-10-08 16:57:36,411 - main_logger - picker.py:350 - INFO - setup_logger - MainProcess - Begun analysis of mikado_prepared.gtf
2019-10-08 16:57:36,411 - main_logger - picker.py:352 - INFO - setup_logger - MainProcess - Command line: /data/home/btx333/.conda/envs/mikado2.1/bin/mikado pick --json-conf configuration.yaml --subloci-out mikado.subloci.gff3 --shm --no-pad --log pick.log
2019-10-08 16:57:36,411 - main_logger - picker.py:274 - INFO - setup_shm_db - MainProcess - Copying Mikado database into a SHM db
2019-10-08 16:57:37,781 - main_logger - picker.py:293 - INFO - setup_shm_db - MainProcess - DB copied into memory
2019-10-08 16:57:37,781 - listener - picker.py:369 - WARNING - setup_logger - MainProcess - Current level for queue: INFO
2019-10-08 16:57:37,909 - listener - picker.py:105 - INFO - __init__ - MainProcess - Random seed: 2834569384
2019-10-08 16:57:38,162 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold01
2019-10-08 16:57:47,529 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold01
2019-10-08 16:57:47,531 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold02
2019-10-08 16:57:54,099 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold02
2019-10-08 16:57:54,102 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold03
2019-10-08 16:58:00,794 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold03
2019-10-08 16:58:00,795 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold04
2019-10-08 16:58:05,983 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold04
2019-10-08 16:58:05,985 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold05
2019-10-08 16:58:11,523 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold05
2019-10-08 16:58:12,024 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold06
2019-10-08 16:58:17,151 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold06
2019-10-08 16:58:17,616 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold07
2019-10-08 16:58:23,040 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold07
2019-10-08 16:58:23,040 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold08
2019-10-08 16:58:27,665 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold08
2019-10-08 16:58:27,665 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold09
2019-10-08 16:58:32,882 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold09
2019-10-08 16:58:32,884 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold10
2019-10-08 16:58:38,165 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold10
2019-10-08 16:58:38,165 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold11
2019-10-08 16:58:42,873 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold11
2019-10-08 16:58:42,874 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold12
2019-10-08 16:58:46,108 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold12
2019-10-08 16:58:46,111 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold13
2019-10-08 16:58:49,454 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold13
2019-10-08 16:58:49,454 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold14
2019-10-08 16:58:53,142 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold14
2019-10-08 16:58:53,142 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold15
2019-10-08 16:58:56,390 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold15
2019-10-08 16:58:56,390 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold16
2019-10-08 16:58:59,380 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold16
2019-10-08 16:58:59,382 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold17
2019-10-08 16:59:01,583 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold17
2019-10-08 16:59:01,584 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold18
2019-10-08 16:59:04,192 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold18
2019-10-08 16:59:04,192 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold19
2019-10-08 16:59:05,687 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold19
2019-10-08 16:59:05,687 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold20
2019-10-08 16:59:07,255 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold20
2019-10-08 16:59:07,255 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold21
2019-10-08 16:59:09,710 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold21
2019-10-08 16:59:09,710 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold22
2019-10-08 16:59:11,373 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold22
2019-10-08 16:59:11,373 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold23
2019-10-08 16:59:12,568 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold23
2019-10-08 16:59:12,568 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold24
2019-10-08 16:59:13,574 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold24
2019-10-08 16:59:13,575 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold25
2019-10-08 16:59:14,387 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold25
2019-10-08 16:59:14,387 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold26
2019-10-08 16:59:14,982 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold26
2019-10-08 16:59:14,982 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold27
2019-10-08 16:59:15,156 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold27
2019-10-08 16:59:15,156 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold28
2019-10-08 16:59:15,379 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold28
2019-10-08 16:59:15,379 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold29
2019-10-08 16:59:15,423 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold29
2019-10-08 16:59:15,427 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold30
2019-10-08 16:59:16,823 - listener - picker.py:913 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold30
Dear @ChemaMD ,
I added the --no-pad flag as in previous attempts there was a WARNING regarding the padding of one transcript (see previous message).
My fault, the message should really be INFO level. It is literally logging which transcripts had been padded, nothing more.
Below the log this run made (it is still running but not writing anything down). I consistently get a WARNING message with setup_logger - MainProcess - Current level for queue: INFO. The genome has 30 scaffolds, and after finishing chromosome 30 it keeps running, but not outputting anything.
In the output folder there should be a temporary folder (mikado_pick_tmp*
), with temporary files being written into. If that is the case, things are proceeding as they should.
The parsing message is related to the parsing of the file, not to the actual analysis. I think I will change the behaviour so that it reports when the analysis is actually finished for a given chromosome.
Dear @lucventurini ,
The command did not generate the temporary folder, and it is still running (it's been running for 5h), without loading anything in the mikado.loci.*
and mikado.subloci.*
files, which have their headings, but are empty. The temporary folder only appears when I run mikado pick
with more than two processors. Inside of that folder, there are several files, also empty. To me it looks as if mikado pick
prepare everything, but then does not start picking transcripts. I think the configuration.yaml
file looks correct (attached), and points to all the correct files.
Below, the log of the commands I run:
mikado configure -t 6 --list Owenia_mikadoInput_list --reference Owenia_softmasked_v072019.fa --mode permissive --scoring human.yaml --copy-scoring vertebrates.yaml --junctions ./mikado_input/portcullis_filtered.pass.junctions.bed -bt /data/SBCS-MartinDuranLab/00-BlastDBs/Uniprot_SwissProt_160719 configuration.yaml
mikado prepare --json-conf configuration.yaml
mikado serialise --procs 1 --json-conf configuration.yaml --xml mikado.diamond.xml --orfs mikado_prepared.fasta.transdecoder.bed --blast_targets /data/SBCS-MartinDuranLab/00-BlastDBs/Uniprot_SwissProt_160719.fasta --transcripts mikado_prepared.fasta --junctions /data/scratch/btx333/10-Mikado/mikado_input/portcullis_filtered.pass.junctions.bed
mikado pick --json-conf configuration.yaml --subloci-out mikado.subloci.gff3 --shm --no-pad --log pick.log
Dear @ChemaMD, it is really strange. Mikado pick should be writing to the files inside the temporary directory almost immediately. Would you please be able to do the following:
mikado pick
on that file, adding as well the -lv DEBUG
switch to the commandThis will force mikado to go into a single process and should finish really quickly. The log will have a lot of information we can use to understand what is going on - especially if indeed mikado will still not finish even with a reduced GTF.
Hi @lucventurini , it now worked well with just 10,000 lines gtf. I attached the pick_test.log
. The original GTF file is 9,212,560 lines
pick_test.log
Dear @ChemaMD , OK this is progress! May I ask you then to:
head
command, it is truncated, hence crashing mikado
- this is the expected behaviour)-lv DEBUG
switchIf it functions, then we should try with a whole scaffold/chromosome (probably the smallest would be the safest bet), again starting with the -lv DEBUG
.
Also, I did modify the code of Mikado (see 36567b0 for details, but please download 9da88d4) so that now in the log it reports when it has actually finished to analyse a locus, while in multiprocessing mode. In single processing mode, the messages were already informative (ie when it says it has finished a chromosome, it has finished analysing it).
When launching with multiple processes, inside the temporary folder, there should be a temp_store.db
file (where the transcripts to be processed are stored) and multiple output_*.db
files (where each child process writes out the results before the final merging). These are all SQLite3 databases. All these files should be populated fairly quickly during the run, so if they are not, something strange is going on.
I am honestly still unclear on what is going on, but hopefully through this debugging we can solve the issue.
Hi @lucventurini , I just tried with the multiprocessors and small fixed GTF and worked well. It created the temporary file. It also worked well with just the smallest chromosome. Also when adding the --subloci-out
flag (BTW, I think in previous versions and in the documentation, the flag was --subloci_out
with underscore). I will launch it now with the entire dataset and see what happens.
mikado pick --procs 3 --json-conf configuration.yaml mikado_test.gtf > pick_test2.log 2>&1
mikado pick --procs 3 --json-conf configuration.yaml mikado_prepared.Scaffold30.gtf > pick_test3.log 2>&1
Hi @lucventurini , I just tried with the multiprocessors and small fixed GTF and worked well. It created the temporary file. It also worked well with just the smallest chromosome.
Excellent. Hopefully the new logging will give you a sense of how it is proceeding with the full dataset. Thank you for nudging me in making those messages more informative!
BTW, I think in previous versions and in the documentation, the flag was --subloci_out with underscore
Yes, well spotted - I changed it recently. Documentation is being updated on a dedicated branch and, once this issue and #226 are fixed, is one of the few remaining roadblocks for finally having an official Mikado2.
Dear @lucventurini , I launched mikado pick
with the entire dataset in DEBUG mode and everything seemed to go as planned (I stopped it after five hours, since I was estimating that it would take around 7 days in DEBUG mode to parse the entire mikado_prepared.gtf
file). I then relaunched the job without the -lv DEBUG
flag, this time with 10 CPUs and 1 GB of RAM allocated/CPU. The run was launched at 16:55, it generated the mikado_pick_tmp*
folder and created three tmp_store.db*
files and ten output-[1-10].db files
. mikado pick
wrote into these files at 16:57, so barely after the job started, but nothing else after that (but apparently the job is still running in the cluster). The log stopped being produced at exactly that time (log pasted below). Is this what I should expect? Thanks again for all the help and prompt responses.
==== LOG ====
2019-10-09 16:55:43,872 - main_logger - picker.py:350 - INFO - setup_logger - MainProcess - Begun analysis of mikado_prepared.gtf
2019-10-09 16:55:43,872 - main_logger - picker.py:352 - INFO - setup_logger - MainProcess - Command line: /data/home/btx333/.conda/envs/mikado2.1/bin/mikado pick --procs 10 --json-conf configuration.yaml --subloci-out mikado.subloci.gff3
2019-10-09 16:55:43,873 - listener - picker.py:369 - WARNING - setup_logger - MainProcess - Current level for queue: INFO
2019-10-09 16:55:44,034 - listener - picker.py:105 - INFO - __init__ - MainProcess - Random seed: 2834569384
2019-10-09 16:55:44,753 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold01
2019-10-09 16:55:54,342 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold01
2019-10-09 16:55:54,343 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold02
2019-10-09 16:56:01,011 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold02
2019-10-09 16:56:01,012 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold03
2019-10-09 16:56:07,760 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold03
2019-10-09 16:56:07,760 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold04
2019-10-09 16:56:12,933 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold04
2019-10-09 16:56:12,935 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold05
2019-10-09 16:56:18,335 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold05
2019-10-09 16:56:18,949 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold06
2019-10-09 16:56:23,884 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold06
2019-10-09 16:56:24,416 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold07
2019-10-09 16:56:29,808 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold07
2019-10-09 16:56:29,808 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold08
2019-10-09 16:56:34,535 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold08
2019-10-09 16:56:34,536 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold09
2019-10-09 16:56:39,845 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold09
2019-10-09 16:56:39,845 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold10
2019-10-09 16:56:44,978 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold10
2019-10-09 16:56:44,978 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold11
2019-10-09 16:56:49,721 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold11
2019-10-09 16:56:49,722 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold12
2019-10-09 16:56:52,923 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold12
2019-10-09 16:56:52,924 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold13
2019-10-09 16:56:56,259 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold13
2019-10-09 16:56:56,259 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold14
2019-10-09 16:57:00,088 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold14
2019-10-09 16:57:00,088 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold15
2019-10-09 16:57:03,312 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold15
2019-10-09 16:57:03,312 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold16
2019-10-09 16:57:06,265 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold16
2019-10-09 16:57:06,265 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold17
2019-10-09 16:57:08,525 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold17
2019-10-09 16:57:08,525 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold18
2019-10-09 16:57:11,029 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold18
2019-10-09 16:57:11,030 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold19
2019-10-09 16:57:12,528 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold19
2019-10-09 16:57:12,528 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold20
2019-10-09 16:57:14,119 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold20
2019-10-09 16:57:14,119 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold21
2019-10-09 16:57:16,539 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold21
2019-10-09 16:57:16,539 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold22
2019-10-09 16:57:18,169 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold22
2019-10-09 16:57:18,170 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold23
2019-10-09 16:57:19,370 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold23
2019-10-09 16:57:19,370 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold24
2019-10-09 16:57:20,446 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold24
2019-10-09 16:57:20,446 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold25
2019-10-09 16:57:21,165 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold25
2019-10-09 16:57:21,165 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold26
2019-10-09 16:57:21,777 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold26
2019-10-09 16:57:21,777 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold27
2019-10-09 16:57:21,954 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold27
2019-10-09 16:57:21,954 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold28
2019-10-09 16:57:22,132 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold28
2019-10-09 16:57:22,132 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold29
2019-10-09 16:57:22,176 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold29
2019-10-09 16:57:22,181 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold30
2019-10-09 16:57:25,363 - listener - picker.py:913 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold30
2019-10-09 16:57:26,590 - Scaffold01:21107-135422 - locus.py:1286 - INFO - expand_transcript - LociProcesser-5 - gmap1_TRINITY_DN103619_c1_g2_i21.path1 has now start 93847, end 132507.
2019-10-09 16:57:26,593 - Scaffold01:21107-135422 - locus.py:195 - WARNING - finalize_alternative_splicing - LociProcesser-5 - gmap1_TRINITY_DN103619_c1_g2_i21.path1 is now padded
2019-10-09 16:57:26,729 - Scaffold02:38065332-38130546 - locus.py:1286 - INFO - expand_transcript - LociProcesser-4 - gmap3_TRINITY_DN207755_c0_g2_i5.path1 has now start 38074136, end 38094165.
2019-10-09 16:57:26,735 - Scaffold02:38065332-38130546 - locus.py:195 - WARNING - finalize_alternative_splicing - LociProcesser-4 - gmap3_TRINITY_DN207755_c0_g2_i5.path1 is now padded
2019-10-09 16:57:29,015 - Scaffold04:29557752-29800006 - locus.py:1286 - INFO - expand_transcript - LociProcesser-5 - gmap3_TRINITY_DN192120_c7_g1_i1.path1 has now start 29664764, end 29668632.
2019-10-09 16:57:29,018 - Scaffold04:29557752-29800006 - locus.py:195 - WARNING - finalize_alternative_splicing - LociProcesser-5 - gmap3_TRINITY_DN192120_c7_g1_i1.path1 is now padded
2019-10-09 16:57:29,155 - Scaffold04:29557752-29800006 - locus.py:1286 - INFO - expand_transcript - LociProcesser-5 - gmap2_ofu_40702.5.path1 has now start 29560143, end 29592540.
2019-10-09 16:57:29,157 - Scaffold04:29557752-29800006 - locus.py:195 - WARNING - finalize_alternative_splicing - LociProcesser-5 - gmap2_ofu_40702.5.path1 is now padded
2019-10-09 16:57:30,012 - Scaffold03:36653287-36861509 - locus.py:1286 - INFO - expand_transcript - LociProcesser-9 - gmap1_TRINITY_DN116343_c1_g1_i1.path2 has now start 36826171, end 36853449.
2019-10-09 16:57:30,023 - Scaffold03:36653287-36861509 - locus.py:1286 - INFO - expand_transcript - LociProcesser-9 - gmap3_TRINITY_DN202163_c0_g1_i3.path1 has now start 36826171, end 36853449.
2019-10-09 16:57:30,029 - Scaffold03:36653287-36861509 - locus.py:195 - WARNING - finalize_alternative_splicing - LociProcesser-9 - gmap3_TRINITY_DN202163_c0_g1_i3.path1 is now padded
2019-10-09 16:57:30,030 - Scaffold03:36653287-36861509 - locus.py:195 - WARNING - finalize_alternative_splicing - LociProcesser-9 - gmap1_TRINITY_DN116343_c1_g1_i1.path2 is now padded
Dear @ChemaMD, no, it is not normal. Mikado should have dumped the transcripts to the temp_store.db
file pretty continuously. The only reason it might be stalling is if it had found an extremely dense locus (in the order of tens of thousands of transcripts or more) which would slow down the transmission of data to the child processes.
if that is the case, it should get unstuck eventually.
if it does not, then I think I would really need to have a look at the data myself. As I mentioned earlier, I would treat it with absolute confidentiality and I would be ready to sign an NDA after consulting with the relevant department at my institution (or we could, potentially, set up a collaboration - whatever would work best).
Kind regards
Dear @lucventurini , I started two approaches to overcome this issue and test whether the problem is a complex locus that stalls mikado pick
. I have launched two different jobs, one with the entire dataset and 27 allocated cores. For the other one, I split the original dataset on scaffolds (30 jobs) and launched mikado pick
with 3 cores for each of these subsets, which appeared to work well before, in one of the tests we did yesterday. The second approach is running, and managed to process the entire Scaffold 1 quite quickly, but gave an error at the moment of merging:
===== LOG ====
2019-10-10 10:41:58,907 - main_logger - picker.py:350 - INFO - setup_logger - MainProcess - Begun analysis of mikado_prepared_Scaffold1.gtf
2019-10-10 10:41:58,907 - main_logger - picker.py:352 - INFO - setup_logger - MainProcess - Command line: /data/home/btx333/.conda/envs/mikado2.1/bin/mikado pick --procs 3 --json-conf configuration.yaml --output-dir Scaffold1 mikado_prepared_Scaffold1.gtf
2019-10-10 10:41:58,908 - listener - picker.py:369 - WARNING - setup_logger - MainProcess - Current level for queue: INFO
2019-10-10 10:41:59,044 - listener - picker.py:105 - INFO - __init__ - MainProcess - Random seed: 2834569384
2019-10-10 10:41:59,442 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold01
2019-10-10 10:42:09,653 - listener - picker.py:913 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold01
2019-10-10 10:42:10,956 - Scaffold01:21107-135422 - locus.py:1286 - INFO - expand_transcript - LociProcesser-3 - gmap1_TRINITY_DN103619_c1_g2_i21.path1 has now start 93847, end 132507.
2019-10-10 10:42:10,959 - Scaffold01:21107-135422 - locus.py:195 - WARNING - finalize_alternative_splicing - LociProcesser-3 - gmap1_TRINITY_DN103619_c1_g2_i21.path1 is now padded
2019-10-10 10:45:00,606 - listener - picker.py:743 - INFO - __submit_multi_threading - MainProcess - Joined children processes; starting to merge partial files
2019-10-10 10:45:00,613 - main - __init__.py:120 - ERROR - main - MainProcess - Mikado crashed, cause:
2019-10-10 10:45:00,613 - main - __init__.py:121 - ERROR - main - MainProcess - {10}
Traceback (most recent call last):
File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/__init__.py", line 106, in main
args.func(args)
File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/subprograms/pick.py", line 199, in pick
creator()
File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/picking/picker.py", line 1078, in __call__
self._parse_and_submit_input()
File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/picking/picker.py", line 1044, in _parse_and_submit_input
self.__submit_multi_threading()
File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/picking/picker.py", line 750, in __submit_multi_threading
tempdir=tempdir)
File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/picking/loci_processer.py", line 79, in merge_loci
set.difference(set(range(1, max(common_index.keys()) + 1)), set(common_index.keys()))
AssertionError: {10}
Dear @ChemaMD, thank you, this looks like a proper bug. I will have a look at the relevant code now.
Regarding finding the locus that is creating the problem - I have thought of a way.
1- Convert the prepared GTF into a BED12, putting "1" as a score:
mikado util convert -of bed12 mikado_prepared.gtf | awk -F "\t" '{OFS="\t"; $5=1; print $0}' > mikado_prepared.bed12
2- Use BEDtools to merge the intervals and find the culprit:
bedtools merge -c 5 -d 200 -i Daijin/5-mikado/mikado_prepared.1.bed12 -o sum > merged.bed
Please change -d 200
to the value assigned to the pick/clustering/flank
value inside the Mikado configuration (default should be 200).
This should already give us an indication of which loci are massive: it will produce a BED file where the fourth column is the number of transcripts per locus. This should lead us fairly quickly to the culprit.
Another possible solution, and this could definitely make a big impact, is to change the parameter prepare/max_intron_length
. By default, mikado prepare
is quite lenient and will keep anything with an intron of 1Mbps or less. However, a single transcript with an intron of that size might end up chaining up tens of thousands of transcripts, hence blocking the program. Something much reduced, such as 50000, might end up solving your problem by throwing out transcripts that are
a- implausible
b- creating loci too big to swallow for mikado
.
I hope this helps.
PS: @swarbred, would you think that reducing the default value for the maximum intron length (1,000,000) would be a good idea? It seems that keeping it this high is creating problems. I know that there are known introns in the H. sapiens annotation that are as long as 4 megabasepairs, but even there, 99% of the known introns are lower than 100,000 (more precisely, less than 87,000 basepairs). Maybe putting the default limit at 100,000 would be a saner option?
@lucventurini
I would be wary of changing this to 100,000, while their are relatively few transcripts with intron larger than this ~7000 in the human gencode.v29 annotation, excluding them in the prepare will mean remove any chance of annotating these
The Histat2 default is 500,000 there are only 105 above this, so that is probably a better choice if you were to make a change.
We have run lots of rodent genomes through with no issues (with the default 1,000,000, but based on Histat2 alignments with max 500,000 intron size).
@swarbred
Good point. Thank you for the comment.
I think that in general 500,000 might be a saner option, thinking back to some pathological cases. But I will not pull the trigger yet.
@lucventurini ok, I started a new mikado run with the lower intron length (50,000; if mikado now runs completely, I might then consider going up to 500,000 or just 100,000). In my case, this is an annelid genome, and there are not many other genomes to use as reference (the best one would be Capitella, and it is half the size of the genome I'm annotating). From the ~4.8M transcripts I loaded in mikado prepare
, ~27,000 were discarded with the default cutoff and ~80,000 with the 50,000 cutoff. That's a 1.7% of the loaded transcripts (probably even less in terms of genes), quite small proportion. In fact, I checked, an all of them come from Trinity genes mapped to the genome (with GMAP), so this rare cases might just be misassemblies of Trinity? In my dataset, ~2/3 of the loaded transcripts (~3.3M) come from mapped RNA-seq reads, so hopefully the genes will still be annotated.
The test running the initial dataset scaffold by scaffold is still on-going, it only managed to finished the three smallest scaffolds. In some of them, there was an error, I think associated with an invalid codon, which didn't kill the run but implied removing the super locus (Log behind).
2019-10-10 15:06:56,633 - main_logger - picker.py:350 - INFO - setup_logger - MainProcess - Begun analysis of mikado_prepared_Scaffold27.gtf
2019-10-10 15:06:56,633 - main_logger - picker.py:352 - INFO - setup_logger - MainProcess - Command line: /data/home/btx333/.conda/envs/mikado2.1/bin/mikado pick --procs 3 --json-conf configuration.yaml --output-dir Scaffold27 mikado_prepared_Scaffold27.gtf
2019-10-10 15:06:56,634 - listener - picker.py:369 - WARNING - setup_logger - MainProcess - Current level for queue: INFO
2019-10-10 15:06:56,961 - listener - picker.py:105 - INFO - __init__ - MainProcess - Random seed: 2834569384
2019-10-10 15:06:57,641 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold27
2019-10-10 15:06:58,102 - listener - picker.py:913 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold27
2019-10-10 15:20:36,049 - Scaffold27:32517-1419232 - locus.py:822 - ERROR - pad_transcripts - LociProcesser-3 - Codon 'GGg' is invalid
Traceback (most recent call last):
File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 145, in _translate_str
residue = forward_table[codon]
KeyError: 'GGg'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/loci/locus.py", line 818, in pad_transcripts
self.logger)
File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/loci/locus.py", line 1244, in expand_transcript
transcript = enlarge_orfs(transcript, backup, seq, upstream, downstream, logger)
File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/loci/locus.py", line 1545, in enlarge_orfs
orf.expand(seq, upstream, downstream, expand_orf=True, logger=logger)
File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1147, in expand
prot_seq = _translate_str(coding_seq, table=self.table, gap="N")
File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 153, in _translate_str
raise CodonTable.TranslationError("Codon '{0}' is invalid".format(codon))
Bio.Data.CodonTable.TranslationError: Codon 'GGg' is invalid
2019-10-10 15:20:36,088 - Scaffold27:32517-1419232 - loci_processer.py:318 - ERROR - analyse_locus - LociProcesser-3 - Codon 'GGg' is invalid
Traceback (most recent call last):
File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 145, in _translate_str
residue = forward_table[codon]
KeyError: 'GGg'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/picking/loci_processer.py", line 312, in analyse_locus
stranded_locus.define_loci()
File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/loci/superlocus.py", line 1195, in define_loci
self.define_alternative_splicing()
File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/loci/superlocus.py", line 1334, in define_alternative_splicing
self.loci[lid].finalize_alternative_splicing()
File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/loci/locus.py", line 185, in finalize_alternative_splicing
failed = self.__launch_padding()
File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/loci/locus.py", line 212, in __launch_padding
templates = self.pad_transcripts()
File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/loci/locus.py", line 818, in pad_transcripts
self.logger)
File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/loci/locus.py", line 1244, in expand_transcript
transcript = enlarge_orfs(transcript, backup, seq, upstream, downstream, logger)
File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/loci/locus.py", line 1545, in enlarge_orfs
orf.expand(seq, upstream, downstream, expand_orf=True, logger=logger)
File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1147, in expand
prot_seq = _translate_str(coding_seq, table=self.table, gap="N")
File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 153, in _translate_str
raise CodonTable.TranslationError("Codon '{0}' is invalid".format(codon))
Bio.Data.CodonTable.TranslationError: Codon 'GGg' is invalid
2019-10-10 15:20:36,134 - Scaffold27:32517-1419232 - loci_processer.py:319 - ERROR - analyse_locus - LociProcesser-3 - Removing failed locus superlocus:Scaffold27-:55769-419600
2019-10-10 15:20:37,965 - Scaffold27:32517-1419232 - locus.py:1286 - INFO - expand_transcript - LociProcesser-3 - gmap1_TRINITY_DN127226_c2_g1_i3.path1 has now start 824201, end 834738.
2019-10-10 15:20:37,967 - Scaffold27:32517-1419232 - locus.py:195 - WARNING - finalize_alternative_splicing - LociProcesser-3 - gmap1_TRINITY_DN127226_c2_g1_i3.path1 is now padded
2019-10-10 15:20:42,112 - listener - picker.py:743 - INFO - __submit_multi_threading - MainProcess - Joined children processes; starting to merge partial files
2019-10-10 15:20:42,115 - listener - loci_processer.py:89 - INFO - merge_loci - MainProcess - We have a total of 178 genes
2019-10-10 15:20:42,654 - listener - picker.py:752 - INFO - __submit_multi_threading - MainProcess - Finished merging partial files
2019-10-10 15:20:42,659 - main_logger - picker.py:1087 - INFO - __call__ - MainProcess - Finished analysis of mikado_prepared_Scaffold27.gtf
@ChemaMD
From the ~4.8M transcripts I loaded in mikado prepare, ~27,000 were discarded with the default cutoff and ~80,000 with the 50,000 cutoff. That's a 1.7% of the loaded transcripts (probably even less in terms of genes), quite small proportion. In fact, I checked, an all of them come from Trinity genes mapped to the genome (with GMAP), so this rare cases might just be misassemblies of Trinity? In my dataset, ~2/3 of the loaded transcripts (~3.3M) come from mapped RNA-seq reads, so hopefully the genes will still be annotated.
Indeed they might just be Trinity misassemblies, or GMAP misalignments. This would fit the pattern we observed in other datasets.
The test running the initial dataset scaffold by scaffold is still on-going, it only managed to finished the three smallest scaffolds. In some of them, there was an error, I think associated with an invalid codon, which didn't kill the run but implied removing the super locus (Log behind).
Yes indeed. This is an oversight on my part, I will fix it immediately. Thank you again for reporting!
EDIT: fixed in a278687f
@ChemaMD Just to comment for an Annelid genome 50,000 max intron size should be fine, the ~80,000 transcript removed will overwhelmingly be alignment artefacts
Thanks @swarbred and @lucventurini for the feedback and help. The single scaffold runs are progressively finishing, and based on the size of these scaffolds, it appears that mikado pick
is roughly processing ~1.5Mb of genomic sequence and ~150 genes per hour with 3 CPUs. Is that expected?
@ChemaMD, on our A. thaliana test dataset, mikado did approximately 45Mbps/hour per CPU. However, the dataset had about 20 times less transcripts, at about 200, 000.
Your observation is therefore in line with what I would expect - mikado should scale almost linearly with transcript numbers and transcript density, and that seems to be the case here. Throwing more CPU at the problem should help, as the task is completely parallelised.
Thanks @lucventurini ! I will continue with the run with the limited intron size and submit mikado pick
with a good number of CPUs.
Not a problem! Out of curiosity, how big is the genome you are analysing? I had a quick look and the best I could find seemed to indicate a genome size between 250 Mbps and 1.2 Gbps for Anellida. No particular reason, just curious about the organism (and admittedly, about whether my back of the envelope calculations on mikado performance are really in the right ballpark).
The genome is ~550Mb, mid range for an annelid. We will have some others closer to the 1Gb upper range. I did the BED12 calculation for Scaffold01 (of ~44Mb), and there seems to be a super-locus that spans almost the entire chromosome and of course with hundreds of thousands of transcripts:
Scaffold01 11140 11472 1
Scaffold01 12154 15286 18
Scaffold01 16834 17458 6
Scaffold01 20105 20419 1
Scaffold01 21106 135422 456
Scaffold01 136130 136355 1
Scaffold01 137160 137594 1
Scaffold01 138823 142241 22
Scaffold01 142542 142831 3
Scaffold01 145883 146234 1
Scaffold01 147948 43327084 354690
Scaffold01 43329785 43330031 1
Scaffold01 43330355 43330718 1
Scaffold01 43333150 43334005 3
Scaffold01 43335281 43336324 18
Scaffold01 43336656 43337336 9
Scaffold01 43339720 43340005 1
Scaffold01 43340848 43341056 1
Yep, that would nicely break Mikado's parallelisation - all those hundreds of thousands of transcripts would have been analysed by a single process. I hope that the maximum intron heuristic will solve the issue. Please let us know when the run finishes - or whether you encounter any further issue.
Dear @lucventurini , mikado pick
finished without any issues on the run with maximum intron length set to 50Kb! Thank you very much for all your help!
Excellent news! Thank you for letting us know.
I will close the issue, and we can discuss further on Wednesday.
Hi,
Thank you very much for developing such a useful piece of software. I'm running the latest version of Mikado (2) and I'm encountering a parsing problem with an XML diamond BLAST file. I'm running serialise in a cluster employing a single core (I have a single XML file, and from the help info of mikado serialise, it seems I cannot use more CPUs than XML files) and 15GB of RAM. It is apparently crashing when parsing queries that hit a same target "Q9NBX4" (there are 8,001 hits in the XML, it appears to be a transposon). I have attached the logs. Any idea how to overcome this issue?
Thank you very much for your help.
Chema
====== SERIALISE.LOG ======
====== STDOUPUT ======