EI-CoreBioinformatics / mikado

Mikado is a lightweight Python3 pipeline whose purpose is to facilitate the identification of expressed loci from RNA-Seq data * and to select the best models in each locus.
https://mikado.readthedocs.io/en/stable/
GNU Lesser General Public License v3.0
94 stars 18 forks source link

Mikado Serialise crashing in BLAST parsing #229

Closed ChemaMD closed 4 years ago

ChemaMD commented 4 years ago

Hi,

Thank you very much for developing such a useful piece of software. I'm running the latest version of Mikado (2) and I'm encountering a parsing problem with an XML diamond BLAST file. I'm running serialise in a cluster employing a single core (I have a single XML file, and from the help info of mikado serialise, it seems I cannot use more CPUs than XML files) and 15GB of RAM. It is apparently crashing when parsing queries that hit a same target "Q9NBX4" (there are 8,001 hits in the XML, it appears to be a transposon). I have attached the logs. Any idea how to overcome this issue?

Thank you very much for your help.

Chema

====== SERIALISE.LOG ======

2019-10-07 11:27:07,476 - serialiser - serialise.py:288 - INFO - setup - MainProcess - Command line: /data/home/btx333/.conda/envs/mikado2/bin/mikado serialise --procs 1 --json-conf configuration.yaml --xml mikado.diamond.xml --orfs mikado_prepared.fasta.transdecoder.bed --blast_targets /data/SBCS-MartinDuranLab/00-BlastDBs/Uniprot_SwissProt_160719.fasta --transcripts mikado_prepared.fasta --junctions /data/scratch/btx333/10-Mikado/mikado_input/portcullis_filtered.pass.junctions.bed
2019-10-07 11:27:07,476 - serialiser - serialise.py:294 - INFO - setup - MainProcess - Random seed: 2834569384
2019-10-07 11:27:07,476 - serialiser - serialise.py:332 - INFO - setup - MainProcess - Using a sqlite database (location: /data/scratch/btx333/10-Mikado/mikado.db)
2019-10-07 11:27:07,476 - serialiser - serialise.py:336 - INFO - setup - MainProcess - Requested 1 threads, forcing single thread: False
2019-10-07 11:27:07,476 - serialiser - serialise.py:140 - INFO - load_orfs - MainProcess - Starting to load ORF data
2019-10-07 11:29:25,737 - serialiser - orf.py:347 - INFO - __serialize_single_thread - MainProcess - Finished loading 706077 ORFs into the database
2019-10-07 11:29:30,683 - serialiser - serialise.py:152 - INFO - load_orfs - MainProcess - Finished loading ORF data
2019-10-07 11:29:31,208 - serialiser - serialise.py:106 - INFO - load_blast - MainProcess - Starting to load BLAST data
2019-10-07 11:30:49,294 - serialiser - xml_serialiser.py:327 - INFO - __serialize_targets - MainProcess - Started to serialise the targets
2019-10-07 11:31:36,580 - serialiser - xml_serialiser.py:362 - INFO - __serialize_targets - MainProcess - Loaded 560459 objects into the "target" table
2019-10-07 11:31:38,439 - serialiser - xml_serialiser.py:256 - INFO - __serialize_queries - MainProcess - Started to serialise the queries
2019-10-07 11:31:40,126 - serialiser - xml_serialiser.py:304 - INFO - __serialize_queries - MainProcess - Loaded 0 objects into the "query" table
2019-10-07 11:31:52,790 - serialiser - xml_serialiser.py:311 - INFO - __serialize_queries - MainProcess - 3831526 in queries

====== STDOUPUT ======

2019-10-07 11:31:53,183 - main - __init__.py:120 - ERROR - main - MainProcess - Mikado crashed, cause:
2019-10-07 11:31:53,186 - main - __init__.py:121 - ERROR - main - MainProcess - 'Q9NBX4'
Traceback (most recent call last):
  File "/data/home/btx333/.conda/envs/mikado2/lib/python3.7/site-packages/Mikado/__init__.py", line 106, in main
    args.func(args)
  File "/data/home/btx333/.conda/envs/mikado2/lib/python3.7/site-packages/Mikado/subprograms/serialise.py", line 378, in serialise
    load_blast(args, logger)
  File "/data/home/btx333/.conda/envs/mikado2/lib/python3.7/site-packages/Mikado/subprograms/serialise.py", line 124, in load_blast
    part_launcher(filenames)
  File "/data/home/btx333/.conda/envs/mikado2/lib/python3.7/site-packages/Mikado/subprograms/serialise.py", line 54, in xml_launcher
    xml_serializer()
  File "/data/home/btx333/.conda/envs/mikado2/lib/python3.7/site-packages/Mikado/serializers/blast_serializer/xml_serialiser.py", line 536, in __call__
    self.serialize()
  File "/data/home/btx333/.conda/envs/mikado2/lib/python3.7/site-packages/Mikado/serializers/blast_serializer/xml_serialiser.py", line 479, in serialize
    cache=cache, max_target_seqs=self.__max_target_seqs, logger=self.logger)
  File "/data/home/btx333/.conda/envs/mikado2/lib/python3.7/site-packages/Mikado/serializers/blast_serializer/xml_serialiser.py", line 683, in objectify_record
    current_target, cache["target"] = _get_target_for_blast(session, alignment, cache["target"])
  File "/data/home/btx333/.conda/envs/mikado2/lib/python3.7/site-packages/Mikado/serializers/blast_serializer/xml_serialiser.py", line 643, in _get_target_for_blast
    return cache[alignment.accession], cache
KeyError: 'Q9NBX4'
lucventurini commented 4 years ago

Dear @ChemaMD, many thanks for reporting this bug. I will have a look at it now.

Kind regards

lucventurini commented 4 years ago

Dear @ChemaMD , I should have fixed it in the latest push (68fa321). It was a very silly and easy to solve bug, so many thanks for having reported!

Please let me know whether the latest pull will indeed fix it for you.

Kind regards

ChemaMD commented 4 years ago

Thank you very much for such a quick response and solution! I will update and re-run.

ChemaMD commented 4 years ago

Dear @lucventurini , the update solved the issue! Thank you very much for your help.

I'm now with Mikado pick, and it seems to run for long time without actually writing anything in any of the output files. The log gives some warnings and it appears as if the pipeline started checking the database but then got silence. I tried to run the same command with the flag --procs 5 and in all cases allocating ~30-40 Gb of memory in the cluster, and the log is similar. Is this normal?

Thank you for your help.

====== Pick log =====

2019-10-08 15:50:59,944 - main_logger - picker.py:350 - INFO - setup_logger - MainProcess - Begun analysis of mikado_prepared.gtf
2019-10-08 15:50:59,946 - main_logger - picker.py:352 - INFO - setup_logger - MainProcess - Command line: /data/home/btx333/.conda/envs/mikado2.1/bin/mikado pick --procs 1 --json-conf configuration.yaml --subloci-out mikado.subloci.gff3
2019-10-08 15:50:59,947 - listener - picker.py:369 - WARNING - setup_logger - MainProcess - Current level for queue: INFO
2019-10-08 15:51:00,131 - listener - picker.py:105 - INFO - __init__ - MainProcess - Random seed: 2834569384
2019-10-08 15:51:02,723 - Scaffold01:21107-135422 - locus.py:1286 - INFO - expand_transcript - MainProcess - gmap1_TRINITY_DN103619_c1_g2_i21.path1 has now start 93847, end 132507.
2019-10-08 15:51:02,727 - Scaffold01:21107-135422 - locus.py:195 - WARNING - finalize_alternative_splicing - MainProcess - gmap1_TRINITY_DN103619_c1_g2_i21.path1 is now padded
lucventurini commented 4 years ago

Dear @ChemaMD, mikado pick should be quite fast, so if it is not writing down anything there is something that warrants investigation. Would you please be able to send me the log file? Attached here would be fine, if it is small enough.

Best

ChemaMD commented 4 years ago

I just lunched the following command:

mikado pick --json-conf configuration.yaml --subloci-out mikado.subloci.gff3 --shm --no-pad --log pick.log

I added the --no-pad flag as in previous attempts there was a WARNING regarding the padding of one transcript (see previous message).

Below the log this run made (it is still running but not writing anything down). I consistently get a WARNING message with setup_logger - MainProcess - Current level for queue: INFO. The genome has 30 scaffolds, and after finishing chromosome 30 it keeps running, but not outputting anything.

======= PICK LOG ======

2019-10-08 16:57:36,411 - main_logger - picker.py:350 - INFO - setup_logger - MainProcess - Begun analysis of mikado_prepared.gtf
2019-10-08 16:57:36,411 - main_logger - picker.py:352 - INFO - setup_logger - MainProcess - Command line: /data/home/btx333/.conda/envs/mikado2.1/bin/mikado pick --json-conf configuration.yaml --subloci-out mikado.subloci.gff3 --shm --no-pad --log pick.log
2019-10-08 16:57:36,411 - main_logger - picker.py:274 - INFO - setup_shm_db - MainProcess - Copying Mikado database into a SHM db
2019-10-08 16:57:37,781 - main_logger - picker.py:293 - INFO - setup_shm_db - MainProcess - DB copied into memory
2019-10-08 16:57:37,781 - listener - picker.py:369 - WARNING - setup_logger - MainProcess - Current level for queue: INFO
2019-10-08 16:57:37,909 - listener - picker.py:105 - INFO - __init__ - MainProcess - Random seed: 2834569384
2019-10-08 16:57:38,162 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold01
2019-10-08 16:57:47,529 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold01
2019-10-08 16:57:47,531 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold02
2019-10-08 16:57:54,099 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold02
2019-10-08 16:57:54,102 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold03
2019-10-08 16:58:00,794 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold03
2019-10-08 16:58:00,795 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold04
2019-10-08 16:58:05,983 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold04
2019-10-08 16:58:05,985 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold05
2019-10-08 16:58:11,523 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold05
2019-10-08 16:58:12,024 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold06
2019-10-08 16:58:17,151 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold06
2019-10-08 16:58:17,616 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold07
2019-10-08 16:58:23,040 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold07
2019-10-08 16:58:23,040 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold08
2019-10-08 16:58:27,665 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold08
2019-10-08 16:58:27,665 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold09
2019-10-08 16:58:32,882 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold09
2019-10-08 16:58:32,884 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold10
2019-10-08 16:58:38,165 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold10
2019-10-08 16:58:38,165 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold11
2019-10-08 16:58:42,873 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold11
2019-10-08 16:58:42,874 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold12
2019-10-08 16:58:46,108 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold12
2019-10-08 16:58:46,111 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold13
2019-10-08 16:58:49,454 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold13
2019-10-08 16:58:49,454 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold14
2019-10-08 16:58:53,142 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold14
2019-10-08 16:58:53,142 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold15
2019-10-08 16:58:56,390 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold15
2019-10-08 16:58:56,390 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold16
2019-10-08 16:58:59,380 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold16
2019-10-08 16:58:59,382 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold17
2019-10-08 16:59:01,583 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold17
2019-10-08 16:59:01,584 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold18
2019-10-08 16:59:04,192 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold18
2019-10-08 16:59:04,192 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold19
2019-10-08 16:59:05,687 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold19
2019-10-08 16:59:05,687 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold20
2019-10-08 16:59:07,255 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold20
2019-10-08 16:59:07,255 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold21
2019-10-08 16:59:09,710 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold21
2019-10-08 16:59:09,710 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold22
2019-10-08 16:59:11,373 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold22
2019-10-08 16:59:11,373 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold23
2019-10-08 16:59:12,568 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold23
2019-10-08 16:59:12,568 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold24
2019-10-08 16:59:13,574 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold24
2019-10-08 16:59:13,575 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold25
2019-10-08 16:59:14,387 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold25
2019-10-08 16:59:14,387 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold26
2019-10-08 16:59:14,982 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold26
2019-10-08 16:59:14,982 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold27
2019-10-08 16:59:15,156 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold27
2019-10-08 16:59:15,156 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold28
2019-10-08 16:59:15,379 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold28
2019-10-08 16:59:15,379 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold29
2019-10-08 16:59:15,423 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold29
2019-10-08 16:59:15,427 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold30
2019-10-08 16:59:16,823 - listener - picker.py:913 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold30
lucventurini commented 4 years ago

Dear @ChemaMD ,

I added the --no-pad flag as in previous attempts there was a WARNING regarding the padding of one transcript (see previous message).

My fault, the message should really be INFO level. It is literally logging which transcripts had been padded, nothing more.

Below the log this run made (it is still running but not writing anything down). I consistently get a WARNING message with setup_logger - MainProcess - Current level for queue: INFO. The genome has 30 scaffolds, and after finishing chromosome 30 it keeps running, but not outputting anything.

In the output folder there should be a temporary folder (mikado_pick_tmp*), with temporary files being written into. If that is the case, things are proceeding as they should. The parsing message is related to the parsing of the file, not to the actual analysis. I think I will change the behaviour so that it reports when the analysis is actually finished for a given chromosome.

ChemaMD commented 4 years ago

Dear @lucventurini ,

The command did not generate the temporary folder, and it is still running (it's been running for 5h), without loading anything in the mikado.loci.* and mikado.subloci.* files, which have their headings, but are empty. The temporary folder only appears when I run mikado pick with more than two processors. Inside of that folder, there are several files, also empty. To me it looks as if mikado pick prepare everything, but then does not start picking transcripts. I think the configuration.yaml file looks correct (attached), and points to all the correct files.

Below, the log of the commands I run:

mikado configure -t 6 --list Owenia_mikadoInput_list --reference Owenia_softmasked_v072019.fa --mode permissive --scoring human.yaml --copy-scoring vertebrates.yaml --junctions ./mikado_input/portcullis_filtered.pass.junctions.bed -bt /data/SBCS-MartinDuranLab/00-BlastDBs/Uniprot_SwissProt_160719 configuration.yaml
mikado prepare --json-conf configuration.yaml
mikado serialise --procs 1 --json-conf configuration.yaml --xml mikado.diamond.xml --orfs mikado_prepared.fasta.transdecoder.bed --blast_targets /data/SBCS-MartinDuranLab/00-BlastDBs/Uniprot_SwissProt_160719.fasta --transcripts mikado_prepared.fasta --junctions /data/scratch/btx333/10-Mikado/mikado_input/portcullis_filtered.pass.junctions.bed
mikado pick --json-conf configuration.yaml --subloci-out mikado.subloci.gff3 --shm --no-pad --log pick.log

configuration.txt

lucventurini commented 4 years ago

Dear @ChemaMD, it is really strange. Mikado pick should be writing to the files inside the temporary directory almost immediately. Would you please be able to do the following:

This will force mikado to go into a single process and should finish really quickly. The log will have a lot of information we can use to understand what is going on - especially if indeed mikado will still not finish even with a reduced GTF.

ChemaMD commented 4 years ago

Hi @lucventurini , it now worked well with just 10,000 lines gtf. I attached the pick_test.log. The original GTF file is 9,212,560 lines pick_test.log

lucventurini commented 4 years ago

Dear @ChemaMD , OK this is progress! May I ask you then to:

If it functions, then we should try with a whole scaffold/chromosome (probably the smallest would be the safest bet), again starting with the -lv DEBUG.

Also, I did modify the code of Mikado (see 36567b0 for details, but please download 9da88d4) so that now in the log it reports when it has actually finished to analyse a locus, while in multiprocessing mode. In single processing mode, the messages were already informative (ie when it says it has finished a chromosome, it has finished analysing it).

When launching with multiple processes, inside the temporary folder, there should be a temp_store.db file (where the transcripts to be processed are stored) and multiple output_*.db files (where each child process writes out the results before the final merging). These are all SQLite3 databases. All these files should be populated fairly quickly during the run, so if they are not, something strange is going on.

I am honestly still unclear on what is going on, but hopefully through this debugging we can solve the issue.

ChemaMD commented 4 years ago

Hi @lucventurini , I just tried with the multiprocessors and small fixed GTF and worked well. It created the temporary file. It also worked well with just the smallest chromosome. Also when adding the --subloci-out flag (BTW, I think in previous versions and in the documentation, the flag was --subloci_out with underscore). I will launch it now with the entire dataset and see what happens.

mikado pick --procs 3 --json-conf configuration.yaml mikado_test.gtf > pick_test2.log 2>&1
mikado pick --procs 3 --json-conf configuration.yaml mikado_prepared.Scaffold30.gtf > pick_test3.log 2>&1
lucventurini commented 4 years ago

Hi @lucventurini , I just tried with the multiprocessors and small fixed GTF and worked well. It created the temporary file. It also worked well with just the smallest chromosome.

Excellent. Hopefully the new logging will give you a sense of how it is proceeding with the full dataset. Thank you for nudging me in making those messages more informative!

BTW, I think in previous versions and in the documentation, the flag was --subloci_out with underscore

Yes, well spotted - I changed it recently. Documentation is being updated on a dedicated branch and, once this issue and #226 are fixed, is one of the few remaining roadblocks for finally having an official Mikado2.

ChemaMD commented 4 years ago

Dear @lucventurini , I launched mikado pick with the entire dataset in DEBUG mode and everything seemed to go as planned (I stopped it after five hours, since I was estimating that it would take around 7 days in DEBUG mode to parse the entire mikado_prepared.gtf file). I then relaunched the job without the -lv DEBUG flag, this time with 10 CPUs and 1 GB of RAM allocated/CPU. The run was launched at 16:55, it generated the mikado_pick_tmp* folder and created three tmp_store.db* files and ten output-[1-10].db files. mikado pick wrote into these files at 16:57, so barely after the job started, but nothing else after that (but apparently the job is still running in the cluster). The log stopped being produced at exactly that time (log pasted below). Is this what I should expect? Thanks again for all the help and prompt responses.

==== LOG ====

2019-10-09 16:55:43,872 - main_logger - picker.py:350 - INFO - setup_logger - MainProcess - Begun analysis of mikado_prepared.gtf
2019-10-09 16:55:43,872 - main_logger - picker.py:352 - INFO - setup_logger - MainProcess - Command line: /data/home/btx333/.conda/envs/mikado2.1/bin/mikado pick --procs 10 --json-conf configuration.yaml --subloci-out mikado.subloci.gff3
2019-10-09 16:55:43,873 - listener - picker.py:369 - WARNING - setup_logger - MainProcess - Current level for queue: INFO
2019-10-09 16:55:44,034 - listener - picker.py:105 - INFO - __init__ - MainProcess - Random seed: 2834569384
2019-10-09 16:55:44,753 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold01
2019-10-09 16:55:54,342 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold01
2019-10-09 16:55:54,343 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold02
2019-10-09 16:56:01,011 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold02
2019-10-09 16:56:01,012 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold03
2019-10-09 16:56:07,760 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold03
2019-10-09 16:56:07,760 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold04
2019-10-09 16:56:12,933 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold04
2019-10-09 16:56:12,935 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold05
2019-10-09 16:56:18,335 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold05
2019-10-09 16:56:18,949 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold06
2019-10-09 16:56:23,884 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold06
2019-10-09 16:56:24,416 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold07
2019-10-09 16:56:29,808 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold07
2019-10-09 16:56:29,808 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold08
2019-10-09 16:56:34,535 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold08
2019-10-09 16:56:34,536 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold09
2019-10-09 16:56:39,845 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold09
2019-10-09 16:56:39,845 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold10
2019-10-09 16:56:44,978 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold10
2019-10-09 16:56:44,978 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold11
2019-10-09 16:56:49,721 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold11
2019-10-09 16:56:49,722 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold12
2019-10-09 16:56:52,923 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold12
2019-10-09 16:56:52,924 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold13
2019-10-09 16:56:56,259 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold13
2019-10-09 16:56:56,259 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold14
2019-10-09 16:57:00,088 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold14
2019-10-09 16:57:00,088 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold15
2019-10-09 16:57:03,312 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold15
2019-10-09 16:57:03,312 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold16
2019-10-09 16:57:06,265 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold16
2019-10-09 16:57:06,265 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold17
2019-10-09 16:57:08,525 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold17
2019-10-09 16:57:08,525 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold18
2019-10-09 16:57:11,029 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold18
2019-10-09 16:57:11,030 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold19
2019-10-09 16:57:12,528 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold19
2019-10-09 16:57:12,528 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold20
2019-10-09 16:57:14,119 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold20
2019-10-09 16:57:14,119 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold21
2019-10-09 16:57:16,539 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold21
2019-10-09 16:57:16,539 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold22
2019-10-09 16:57:18,169 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold22
2019-10-09 16:57:18,170 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold23
2019-10-09 16:57:19,370 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold23
2019-10-09 16:57:19,370 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold24
2019-10-09 16:57:20,446 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold24
2019-10-09 16:57:20,446 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold25
2019-10-09 16:57:21,165 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold25
2019-10-09 16:57:21,165 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold26
2019-10-09 16:57:21,777 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold26
2019-10-09 16:57:21,777 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold27
2019-10-09 16:57:21,954 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold27
2019-10-09 16:57:21,954 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold28
2019-10-09 16:57:22,132 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold28
2019-10-09 16:57:22,132 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold29
2019-10-09 16:57:22,176 - listener - picker.py:864 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold29
2019-10-09 16:57:22,181 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold30
2019-10-09 16:57:25,363 - listener - picker.py:913 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold30
2019-10-09 16:57:26,590 - Scaffold01:21107-135422 - locus.py:1286 - INFO - expand_transcript - LociProcesser-5 - gmap1_TRINITY_DN103619_c1_g2_i21.path1 has now start 93847, end 132507.
2019-10-09 16:57:26,593 - Scaffold01:21107-135422 - locus.py:195 - WARNING - finalize_alternative_splicing - LociProcesser-5 - gmap1_TRINITY_DN103619_c1_g2_i21.path1 is now padded
2019-10-09 16:57:26,729 - Scaffold02:38065332-38130546 - locus.py:1286 - INFO - expand_transcript - LociProcesser-4 - gmap3_TRINITY_DN207755_c0_g2_i5.path1 has now start 38074136, end 38094165.
2019-10-09 16:57:26,735 - Scaffold02:38065332-38130546 - locus.py:195 - WARNING - finalize_alternative_splicing - LociProcesser-4 - gmap3_TRINITY_DN207755_c0_g2_i5.path1 is now padded
2019-10-09 16:57:29,015 - Scaffold04:29557752-29800006 - locus.py:1286 - INFO - expand_transcript - LociProcesser-5 - gmap3_TRINITY_DN192120_c7_g1_i1.path1 has now start 29664764, end 29668632.
2019-10-09 16:57:29,018 - Scaffold04:29557752-29800006 - locus.py:195 - WARNING - finalize_alternative_splicing - LociProcesser-5 - gmap3_TRINITY_DN192120_c7_g1_i1.path1 is now padded
2019-10-09 16:57:29,155 - Scaffold04:29557752-29800006 - locus.py:1286 - INFO - expand_transcript - LociProcesser-5 - gmap2_ofu_40702.5.path1 has now start 29560143, end 29592540.
2019-10-09 16:57:29,157 - Scaffold04:29557752-29800006 - locus.py:195 - WARNING - finalize_alternative_splicing - LociProcesser-5 - gmap2_ofu_40702.5.path1 is now padded
2019-10-09 16:57:30,012 - Scaffold03:36653287-36861509 - locus.py:1286 - INFO - expand_transcript - LociProcesser-9 - gmap1_TRINITY_DN116343_c1_g1_i1.path2 has now start 36826171, end 36853449.
2019-10-09 16:57:30,023 - Scaffold03:36653287-36861509 - locus.py:1286 - INFO - expand_transcript - LociProcesser-9 - gmap3_TRINITY_DN202163_c0_g1_i3.path1 has now start 36826171, end 36853449.
2019-10-09 16:57:30,029 - Scaffold03:36653287-36861509 - locus.py:195 - WARNING - finalize_alternative_splicing - LociProcesser-9 - gmap3_TRINITY_DN202163_c0_g1_i3.path1 is now padded
2019-10-09 16:57:30,030 - Scaffold03:36653287-36861509 - locus.py:195 - WARNING - finalize_alternative_splicing - LociProcesser-9 - gmap1_TRINITY_DN116343_c1_g1_i1.path2 is now padded
lucventurini commented 4 years ago

Dear @ChemaMD, no, it is not normal. Mikado should have dumped the transcripts to the temp_store.db file pretty continuously. The only reason it might be stalling is if it had found an extremely dense locus (in the order of tens of thousands of transcripts or more) which would slow down the transmission of data to the child processes. if that is the case, it should get unstuck eventually. if it does not, then I think I would really need to have a look at the data myself. As I mentioned earlier, I would treat it with absolute confidentiality and I would be ready to sign an NDA after consulting with the relevant department at my institution (or we could, potentially, set up a collaboration - whatever would work best).

Kind regards

ChemaMD commented 4 years ago

Dear @lucventurini , I started two approaches to overcome this issue and test whether the problem is a complex locus that stalls mikado pick. I have launched two different jobs, one with the entire dataset and 27 allocated cores. For the other one, I split the original dataset on scaffolds (30 jobs) and launched mikado pick with 3 cores for each of these subsets, which appeared to work well before, in one of the tests we did yesterday. The second approach is running, and managed to process the entire Scaffold 1 quite quickly, but gave an error at the moment of merging:

===== LOG ====

2019-10-10 10:41:58,907 - main_logger - picker.py:350 - INFO - setup_logger - MainProcess - Begun analysis of mikado_prepared_Scaffold1.gtf
2019-10-10 10:41:58,907 - main_logger - picker.py:352 - INFO - setup_logger - MainProcess - Command line: /data/home/btx333/.conda/envs/mikado2.1/bin/mikado pick --procs 3 --json-conf configuration.yaml --output-dir Scaffold1 mikado_prepared_Scaffold1.gtf
2019-10-10 10:41:58,908 - listener - picker.py:369 - WARNING - setup_logger - MainProcess - Current level for queue: INFO
2019-10-10 10:41:59,044 - listener - picker.py:105 - INFO - __init__ - MainProcess - Random seed: 2834569384
2019-10-10 10:41:59,442 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold01
2019-10-10 10:42:09,653 - listener - picker.py:913 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold01
2019-10-10 10:42:10,956 - Scaffold01:21107-135422 - locus.py:1286 - INFO - expand_transcript - LociProcesser-3 - gmap1_TRINITY_DN103619_c1_g2_i21.path1 has now start 93847, end 132507.
2019-10-10 10:42:10,959 - Scaffold01:21107-135422 - locus.py:195 - WARNING - finalize_alternative_splicing - LociProcesser-3 - gmap1_TRINITY_DN103619_c1_g2_i21.path1 is now padded
2019-10-10 10:45:00,606 - listener - picker.py:743 - INFO - __submit_multi_threading - MainProcess - Joined children processes; starting to merge partial files
2019-10-10 10:45:00,613 - main - __init__.py:120 - ERROR - main - MainProcess - Mikado crashed, cause:
2019-10-10 10:45:00,613 - main - __init__.py:121 - ERROR - main - MainProcess - {10}
Traceback (most recent call last):
  File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/__init__.py", line 106, in main
    args.func(args)
  File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/subprograms/pick.py", line 199, in pick
    creator()
  File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/picking/picker.py", line 1078, in __call__
    self._parse_and_submit_input()
  File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/picking/picker.py", line 1044, in _parse_and_submit_input
    self.__submit_multi_threading()
  File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/picking/picker.py", line 750, in __submit_multi_threading
    tempdir=tempdir)
  File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/picking/loci_processer.py", line 79, in merge_loci
    set.difference(set(range(1, max(common_index.keys()) + 1)), set(common_index.keys()))
AssertionError: {10}
lucventurini commented 4 years ago

Dear @ChemaMD, thank you, this looks like a proper bug. I will have a look at the relevant code now.

Regarding finding the locus that is creating the problem - I have thought of a way.

1- Convert the prepared GTF into a BED12, putting "1" as a score: mikado util convert -of bed12 mikado_prepared.gtf | awk -F "\t" '{OFS="\t"; $5=1; print $0}' > mikado_prepared.bed12 2- Use BEDtools to merge the intervals and find the culprit: bedtools merge -c 5 -d 200 -i Daijin/5-mikado/mikado_prepared.1.bed12 -o sum > merged.bed

Please change -d 200 to the value assigned to the pick/clustering/flank value inside the Mikado configuration (default should be 200).

This should already give us an indication of which loci are massive: it will produce a BED file where the fourth column is the number of transcripts per locus. This should lead us fairly quickly to the culprit.

Another possible solution, and this could definitely make a big impact, is to change the parameter prepare/max_intron_length. By default, mikado prepare is quite lenient and will keep anything with an intron of 1Mbps or less. However, a single transcript with an intron of that size might end up chaining up tens of thousands of transcripts, hence blocking the program. Something much reduced, such as 50000, might end up solving your problem by throwing out transcripts that are a- implausible b- creating loci too big to swallow for mikado.

I hope this helps.

lucventurini commented 4 years ago

PS: @swarbred, would you think that reducing the default value for the maximum intron length (1,000,000) would be a good idea? It seems that keeping it this high is creating problems. I know that there are known introns in the H. sapiens annotation that are as long as 4 megabasepairs, but even there, 99% of the known introns are lower than 100,000 (more precisely, less than 87,000 basepairs). Maybe putting the default limit at 100,000 would be a saner option?

swarbred commented 4 years ago

@lucventurini

I would be wary of changing this to 100,000, while their are relatively few transcripts with intron larger than this ~7000 in the human gencode.v29 annotation, excluding them in the prepare will mean remove any chance of annotating these

http://www.ensembl.org/Homo_sapiens/Location/View?db=core;g=ENSG00000011105;r=12:3077379-3286559;t=ENST00000011898

The Histat2 default is 500,000 there are only 105 above this, so that is probably a better choice if you were to make a change.

We have run lots of rodent genomes through with no issues (with the default 1,000,000, but based on Histat2 alignments with max 500,000 intron size).

lucventurini commented 4 years ago

@swarbred

Good point. Thank you for the comment.

I think that in general 500,000 might be a saner option, thinking back to some pathological cases. But I will not pull the trigger yet.

ChemaMD commented 4 years ago

@lucventurini ok, I started a new mikado run with the lower intron length (50,000; if mikado now runs completely, I might then consider going up to 500,000 or just 100,000). In my case, this is an annelid genome, and there are not many other genomes to use as reference (the best one would be Capitella, and it is half the size of the genome I'm annotating). From the ~4.8M transcripts I loaded in mikado prepare, ~27,000 were discarded with the default cutoff and ~80,000 with the 50,000 cutoff. That's a 1.7% of the loaded transcripts (probably even less in terms of genes), quite small proportion. In fact, I checked, an all of them come from Trinity genes mapped to the genome (with GMAP), so this rare cases might just be misassemblies of Trinity? In my dataset, ~2/3 of the loaded transcripts (~3.3M) come from mapped RNA-seq reads, so hopefully the genes will still be annotated.

The test running the initial dataset scaffold by scaffold is still on-going, it only managed to finished the three smallest scaffolds. In some of them, there was an error, I think associated with an invalid codon, which didn't kill the run but implied removing the super locus (Log behind).

2019-10-10 15:06:56,633 - main_logger - picker.py:350 - INFO - setup_logger - MainProcess - Begun analysis of mikado_prepared_Scaffold27.gtf
2019-10-10 15:06:56,633 - main_logger - picker.py:352 - INFO - setup_logger - MainProcess - Command line: /data/home/btx333/.conda/envs/mikado2.1/bin/mikado pick --procs 3 --json-conf configuration.yaml --output-dir Scaffold27 mikado_prepared_Scaffold27.gtf
2019-10-10 15:06:56,634 - listener - picker.py:369 - WARNING - setup_logger - MainProcess - Current level for queue: INFO
2019-10-10 15:06:56,961 - listener - picker.py:105 - INFO - __init__ - MainProcess - Random seed: 2834569384
2019-10-10 15:06:57,641 - listener - picker.py:872 - INFO - __parse_multithreaded - MainProcess - Starting chromosome Scaffold27
2019-10-10 15:06:58,102 - listener - picker.py:913 - INFO - __parse_multithreaded - MainProcess - Finished chromosome Scaffold27
2019-10-10 15:20:36,049 - Scaffold27:32517-1419232 - locus.py:822 - ERROR - pad_transcripts - LociProcesser-3 - Codon 'GGg' is invalid
Traceback (most recent call last):
  File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 145, in _translate_str
    residue = forward_table[codon]
KeyError: 'GGg'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/loci/locus.py", line 818, in pad_transcripts
    self.logger)
  File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/loci/locus.py", line 1244, in expand_transcript
    transcript = enlarge_orfs(transcript, backup, seq, upstream, downstream, logger)
  File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/loci/locus.py", line 1545, in enlarge_orfs
    orf.expand(seq, upstream, downstream, expand_orf=True, logger=logger)
  File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1147, in expand
    prot_seq = _translate_str(coding_seq, table=self.table, gap="N")
  File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 153, in _translate_str
    raise CodonTable.TranslationError("Codon '{0}' is invalid".format(codon))
Bio.Data.CodonTable.TranslationError: Codon 'GGg' is invalid
2019-10-10 15:20:36,088 - Scaffold27:32517-1419232 - loci_processer.py:318 - ERROR - analyse_locus - LociProcesser-3 - Codon 'GGg' is invalid
Traceback (most recent call last):
  File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 145, in _translate_str
    residue = forward_table[codon]
KeyError: 'GGg'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/picking/loci_processer.py", line 312, in analyse_locus
    stranded_locus.define_loci()
  File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/loci/superlocus.py", line 1195, in define_loci
    self.define_alternative_splicing()
  File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/loci/superlocus.py", line 1334, in define_alternative_splicing
    self.loci[lid].finalize_alternative_splicing()
  File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/loci/locus.py", line 185, in finalize_alternative_splicing
    failed = self.__launch_padding()
  File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/loci/locus.py", line 212, in __launch_padding
    templates = self.pad_transcripts()
  File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/loci/locus.py", line 818, in pad_transcripts
    self.logger)
  File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/loci/locus.py", line 1244, in expand_transcript
    transcript = enlarge_orfs(transcript, backup, seq, upstream, downstream, logger)
  File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/loci/locus.py", line 1545, in enlarge_orfs
    orf.expand(seq, upstream, downstream, expand_orf=True, logger=logger)
  File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1147, in expand
    prot_seq = _translate_str(coding_seq, table=self.table, gap="N")
  File "/data/home/btx333/.conda/envs/mikado2.1/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 153, in _translate_str
    raise CodonTable.TranslationError("Codon '{0}' is invalid".format(codon))
Bio.Data.CodonTable.TranslationError: Codon 'GGg' is invalid
2019-10-10 15:20:36,134 - Scaffold27:32517-1419232 - loci_processer.py:319 - ERROR - analyse_locus - LociProcesser-3 - Removing failed locus superlocus:Scaffold27-:55769-419600
2019-10-10 15:20:37,965 - Scaffold27:32517-1419232 - locus.py:1286 - INFO - expand_transcript - LociProcesser-3 - gmap1_TRINITY_DN127226_c2_g1_i3.path1 has now start 824201, end 834738.
2019-10-10 15:20:37,967 - Scaffold27:32517-1419232 - locus.py:195 - WARNING - finalize_alternative_splicing - LociProcesser-3 - gmap1_TRINITY_DN127226_c2_g1_i3.path1 is now padded
2019-10-10 15:20:42,112 - listener - picker.py:743 - INFO - __submit_multi_threading - MainProcess - Joined children processes; starting to merge partial files
2019-10-10 15:20:42,115 - listener - loci_processer.py:89 - INFO - merge_loci - MainProcess - We have a total of 178 genes
2019-10-10 15:20:42,654 - listener - picker.py:752 - INFO - __submit_multi_threading - MainProcess - Finished merging partial files
2019-10-10 15:20:42,659 - main_logger - picker.py:1087 - INFO - __call__ - MainProcess - Finished analysis of mikado_prepared_Scaffold27.gtf
lucventurini commented 4 years ago

@ChemaMD

From the ~4.8M transcripts I loaded in mikado prepare, ~27,000 were discarded with the default cutoff and ~80,000 with the 50,000 cutoff. That's a 1.7% of the loaded transcripts (probably even less in terms of genes), quite small proportion. In fact, I checked, an all of them come from Trinity genes mapped to the genome (with GMAP), so this rare cases might just be misassemblies of Trinity? In my dataset, ~2/3 of the loaded transcripts (~3.3M) come from mapped RNA-seq reads, so hopefully the genes will still be annotated.

Indeed they might just be Trinity misassemblies, or GMAP misalignments. This would fit the pattern we observed in other datasets.

The test running the initial dataset scaffold by scaffold is still on-going, it only managed to finished the three smallest scaffolds. In some of them, there was an error, I think associated with an invalid codon, which didn't kill the run but implied removing the super locus (Log behind).

Yes indeed. This is an oversight on my part, I will fix it immediately. Thank you again for reporting!

EDIT: fixed in a278687f

swarbred commented 4 years ago

@ChemaMD Just to comment for an Annelid genome 50,000 max intron size should be fine, the ~80,000 transcript removed will overwhelmingly be alignment artefacts

ChemaMD commented 4 years ago

Thanks @swarbred and @lucventurini for the feedback and help. The single scaffold runs are progressively finishing, and based on the size of these scaffolds, it appears that mikado pick is roughly processing ~1.5Mb of genomic sequence and ~150 genes per hour with 3 CPUs. Is that expected?

lucventurini commented 4 years ago

@ChemaMD, on our A. thaliana test dataset, mikado did approximately 45Mbps/hour per CPU. However, the dataset had about 20 times less transcripts, at about 200, 000.

Your observation is therefore in line with what I would expect - mikado should scale almost linearly with transcript numbers and transcript density, and that seems to be the case here. Throwing more CPU at the problem should help, as the task is completely parallelised.

ChemaMD commented 4 years ago

Thanks @lucventurini ! I will continue with the run with the limited intron size and submit mikado pick with a good number of CPUs.

lucventurini commented 4 years ago

Not a problem! Out of curiosity, how big is the genome you are analysing? I had a quick look and the best I could find seemed to indicate a genome size between 250 Mbps and 1.2 Gbps for Anellida. No particular reason, just curious about the organism (and admittedly, about whether my back of the envelope calculations on mikado performance are really in the right ballpark).

ChemaMD commented 4 years ago

The genome is ~550Mb, mid range for an annelid. We will have some others closer to the 1Gb upper range. I did the BED12 calculation for Scaffold01 (of ~44Mb), and there seems to be a super-locus that spans almost the entire chromosome and of course with hundreds of thousands of transcripts:

Scaffold01      11140   11472   1
Scaffold01      12154   15286   18
Scaffold01      16834   17458   6
Scaffold01      20105   20419   1
Scaffold01      21106   135422  456
Scaffold01      136130  136355  1
Scaffold01      137160  137594  1
Scaffold01      138823  142241  22
Scaffold01      142542  142831  3
Scaffold01      145883  146234  1
Scaffold01      147948  43327084        354690
Scaffold01      43329785        43330031        1
Scaffold01      43330355        43330718        1
Scaffold01      43333150        43334005        3
Scaffold01      43335281        43336324        18
Scaffold01      43336656        43337336        9
Scaffold01      43339720        43340005        1
Scaffold01      43340848        43341056        1
lucventurini commented 4 years ago

Yep, that would nicely break Mikado's parallelisation - all those hundreds of thousands of transcripts would have been analysed by a single process. I hope that the maximum intron heuristic will solve the issue. Please let us know when the run finishes - or whether you encounter any further issue.

ChemaMD commented 4 years ago

Dear @lucventurini , mikado pick finished without any issues on the run with maximum intron length set to 50Kb! Thank you very much for all your help!

lucventurini commented 4 years ago

Excellent news! Thank you for letting us know.

I will close the issue, and we can discuss further on Wednesday.