EI-CoreBioinformatics / mikado

Mikado is a lightweight Python3 pipeline whose purpose is to facilitate the identification of expressed loci from RNA-Seq data * and to select the best models in each locus.
https://mikado.readthedocs.io/en/stable/
GNU Lesser General Public License v3.0
94 stars 18 forks source link

Mikado serialize running time #280

Closed srividya22 closed 4 years ago

srividya22 commented 4 years ago

Hi ,

I have been running mikado on the new tomato genome (size : 1.01 Gbp). It finished the mikado prepare step. However it has been running mikado serialize for quite a long time now. Also I have not provided any ORFS for this run. I would like to know what could be a estimated time for the completion for this step. Also please let me know if i dont provide ORFS, it could be any problem.

2020-02-18 22:31:08,078 - serialiser - serialise.py:273 - INFO - setup - MainProcess - Command line: /sonas-hs/schatz/hpc/home/sramakri/miniconda2/envs/mikado/bin/mikado serialise --json-conf groundcherry_conf.yaml --xml mikado.blast.xml.gz --blast_targets ../all_proteins_for_mikado_uniq.fasta -p 50 --transcripts mikado_prepared.fasta 2020-02-18 22:31:08,079 - serialiser - serialise.py:288 - INFO - setup - MainProcess - Requested 50 threads, forcing single thread: False 2020-02-18 22:31:08,079 - serialiser - serialise.py:79 - INFO - load_junctions - MainProcess - Starting to load junctions: ['/seq/schatz/sramakri/genome_annotations/groundcherry/RNASeq/STAR/portcullis/3-filt/portcullis_filtered.pass.junctions.bed'] 2020-02-18 22:31:40,761 - serialiser - serialise.py:89 - INFO - load_junctions - MainProcess - Loaded junctions 2020-02-18 22:31:40,761 - serialiser - serialise.py:147 - INFO - load_orfs - MainProcess - No ORF data provided, skipping 2020-02-18 22:31:40,761 - serialiser - serialise.py:104 - INFO - load_blast - MainProcess - Starting to load BLAST data 2020-02-18 22:32:17,543 - serialiser - xml_serialiser.py:364 - INFO - serialize_targets - MainProcess - Started to serialise the targets 2020-02-18 22:32:30,714 - serialiser - xml_serialiser.py:404 - INFO - serialize_targets - MainProcess - Loaded 642296 objects into the "target" table 2020-02-18 22:32:48,264 - serialiser - xml_serialiser.py:305 - INFO - serialize_queries - MainProcess - Started to serialise the queries 2020-02-18 22:33:07,826 - serialiser - xml_serialiser.py:351 - INFO - serialize_queries - MainProcess - 491431 in queries

lucventurini commented 4 years ago

Dear @srividya22 , Thank you for reporting this. May I ask which version of Mikado you are running? If you are using the last official release (1.2.4), we have made quite a bit of improvements, so that installing from the master branch here on GitHub might be a better option.

Even with the newer version, though, a key way that we speed up the reading of BLAST results is to split the FASTA file in different chunks, execute BLAST on them, and then have Mikado serialise read the different XML files in parallel. This is because reading XML files is inherently a very slow process. From what I am seeing, I think however that you probably have run BLAST on the file as a whole, without chunking. In such an instance Mikado serialise would indeed be very slow.

With chunking (say in 20 or 30 threads) I would say that probably mikado serialise should finish in two or three hours tops, from experience.

Also please let me know if i dont provide ORFS, it could be any problem.

This would be a problem, in general, yes. As you have ~half a million transcripts, I would say that both TransDecoder and prodigal would be appropriate.

We do have a SnakeMake-based pipeline, Daijin, that automates these steps and which is included in Mikado (documentation here; code here). Hopefully it can provide some guidance in how to execute Mikado for your project.

lucventurini commented 4 years ago

Update: the main problem is in the function which performs the nucleotide-to-protein translation; it is by far the bottleneck in loading ORFs.

I am looking for ways to improve its speed (hopefully e.g. with a drop-in C extension). Hopefully this will massively increase the speed of serialise.

lucventurini commented 4 years ago

@srividya22 @swarbred

I have created a new branch, issue-280, where I am making the necessary improvements. The first two commits there should already provide a great speed-up compared to the current version of the code.

By speed-up I mean a ~30% increase in performance (single-threaded) and an almost linear increase in performance for multithreading, as the current master version is actually slower in multiprocessing than in single threaded mode.

srividya22 commented 4 years ago

Thanks @lucventurini. Do you recommend me installing the code from that branch issue-280 for speed up ?

lucventurini commented 4 years ago

@srividya22

I would yes. If you could also report us whether the improvement is present also on your system, we would be very grateful.

Kind regards

srividya22 commented 4 years ago

Sure, I will do that. Also as you suggested I have split the blastx xml files ( using a java script ) for speed up. However I am not sure, If I should split the mikado_prepared.fasta file and the ORFS.bed from Transdecoder for speed up. Kindly comment on that.

I have been running to running transdecoder v5.5.0 . But Transdecoder.Predict seems to fail with the mikado_prepared.fasta at this step below. Error, no sequence for CLASS2_SRR7066585_Super-Scaffold_1.0.0 at

gff3_file_to_proteins.pl line 81. Error, cmd: gff3_file_to_proteins.pl --gff3 longest_orfs.pep.transdecoder.gff3 --fasta transdecoder_out/longest_orfs.pep --genetic_code Universal > longest_orfs.pep.transdecoder.pep died with ret 3072 No such file or directory at transdecoder/PerlLib/Pipeliner.pm line 185. Pipeliner::run(Pipeliner=HASH(0x2b2dc41a9dd0)) called at /sonas-hs/schatz/hpc/home/sramakri/miniconda2/envs/mikado/bin/TransDecoder.Predict line 379

The sequence corresponding to transdecoder in the longest_orfs.pep is "CLASS2_SRR7066585_Super-Scaffold_1.0.0.p1" they are different. looks like an error in parsing the Ids from gff3 .

lucventurini commented 4 years ago

@srividya22

However I am not sure, If I should split the mikado_prepared.fasta file and the ORFS.bed from Transdecoder for speed up. Kindly comment on that.

No, mikado will take care of parallelising that (as BED files are much easier to parse and process than XML files).

I have been running to running transdecoder v5.5.0 . But Transdecoder.Predict seems to fail with the mikado_prepared.fasta at this step below.

Unfortunately I am not an expert on the code of Transdecoder. I know, though, that its internal pipeline manager can get confused if a run gets cancelled and restarted afterwards. If that's happened I would recommend deleting the whole subfolder and restart from LongOrf.

srividya22 commented 4 years ago

Hi @lucventurini

As suggested I have built the version from issue-280 branch of the repository. I ran prodigal instead of Transdecoder for this test. The orfs got loaded but when loading the blast xml I am seeing errors in the log like

File "blast_serializer/utils.py", line 171, in prepare_hit len(positives) * qmulti, q_aligned)) ValueError: Number of identical positions (2451) greater than number of aligned positions (2423)!

lucventurini commented 4 years ago

@srividya22

That error comes from the parsing of the BLAST files. However there is something strange.... That error is NOT at line 171 but rather 185. Apologies for asking, would you please be able to double check the installation of mikado? It looks like a different, older version might be installed..

Edit: I checked and it looks like you have installed and run a version that is quite old, that still contains bug #150. Would you please be able to check whether that's the case?

Kind regards

lucventurini commented 4 years ago

@srividya22 @swarbred

I am also working on making the BLAST XML serialisation faster. I will track progress on this here. There are two bottlenecks:

lucventurini commented 4 years ago

@swarbred

I have created another branch with the improvements for the XML serialisation, issue-280-dev. This branch can be tested tomorrow after we confirm that the ORFs are now serialised in a reasonable time.

In this new branch, I have more than halved the time needed to process a single HSP. Unfortunately, moving to the new version of BioPython XML parsing actually slows down operations. The result should still be a significant speed-up compared to the current version.

srividya22 commented 4 years ago

Hi @lucventurini

I see you have made some changes since yesterday. Let me know when is a good time to rebuild and test on the new changes in the issue-280-dev branch.

lucventurini commented 4 years ago

Hi @srividya22

I am now done with the changes. Actually I would really appreciate if you could test the current version and let us know about how they affect performance on your end!

@swarbred confirmed that the ORF loading is much faster on his dataset with changes I made two days ago (from ~36 hours to ~20 minutes, not a typo). Changes made between yesterday and today affect the loading of BLAST XML files (performance untested).

Thank you

Luca Venturini

srividya22 commented 4 years ago

Thanks for the update. I will rebuild and run it now. Also I have been getting an error in parsing the yaml file, saying -pad is a required property under section Alternative splicing, should I set to true or False by default.

lucventurini commented 4 years ago

Thanks for the update. I will rebuild and run it now. Also I have been getting an error in parsing the yaml file, saying -pad is a required property under section Alternative splicing, should I set to true or False by default.

The default is currently true. We are currently working on the documentation so please ignore the one present online for the time being.

lucventurini commented 4 years ago

Briefly, padding tries to enlarge transcripts using information derived by other transcripts present in the same locus. Given e.g. three transcripts that are valid alternative splicing events of one another, Mikado will try to give them the same terminal ends, potentially also giving them new exons.

This has the effect of obtaining more complete transcripts from fragmentary data e.g. cDNAs or RNASeq assemblies. The process is similar (but not identical, in important details, e.g. Mikado will never create retained introns) from the process described in AtRTD2.

The disadvantage is that this is nonetheless an inference.

Whether padding makes sense for your case depends on your data. If you have RNASeq, I would say it's a sane default.

srividya22 commented 4 years ago

Thank you. If you can you post a most current yaml file for mikado that you use for testing , that would be helpful ? I have uploaded mine here. If you see anything glaring just let me know. Thanks a lot for all the help hereto run Mikado for our project.

groundcherry_conf.yaml.gz

My command line is

--all_blast_xml is a list of xml.gz files from chunked fasta -- transcripts is the mikado_prepared.fasta file -- I was confused about what mode to specify. just using nosplit for now

mikado serialise --xml=${all_blast_xml} --blast_targets=${ref_proteins} --start-method=spawn \ --transcripts=${transcripts} --genome_fai=${fai} --json-conf=${yaml} \ --force --orfs=${orfs} -od ${out_dir} --procs=20 -l mikado_serialize.log

mikado pick --source Mikado_nosplit --mode=nosplit \ --procs=20 --start-method=spawn --json-conf=${yaml} -od ${out_dir} -l mikado_pick.log \ --loci-out ${mikado_output} -lv INFO -db mikado.db mikado_prepared.gtf

I took it from the dajin pipeline code.

lucventurini commented 4 years ago

Dear @srividya22

it looks like this file was generated using mikado v 1.2.4. Would you please be able to regenerate it using the Mikado version from the latest branch? There have been many changes for the 2.x release and the configuration files are not intercompatible. This is why you were having the error regarding the pad property.

I was confused about what mode to specify.

We do explain the splitting mechanism here. The default is permissive but @swarbred and I plan to update it to stringent for Mikado 2. In your configuration file it is currently set to PERMISSIVE (line 54) although of course the command line switch will enforce the nosplit mode.

Please let me know whether the documentation regarding splitting is confusing in any way, so that we can provide eventual clarifications.

swarbred commented 4 years ago

@lucventurini Not seeing any xml loading improvement previous run 21 mins for orf loading 3hrs 29 for xml new run 22 mins for orf loading 5hrs 18 mins for xml

lucventurini commented 4 years ago

@lucventurini Not seeing any xml loading improvement previous run 21 mins for orf loading 3hrs 29 for xml new run 22 mins for orf loading 5hrs 18 mins for xml

Shame :-( OK I will have to have a second look.

swarbred commented 4 years ago

@lucventurini What's the advantage of loading from xml would BLAST tabular format be faster to load, could we support both?

lucventurini commented 4 years ago

@lucventurini What's the advantage of loading from xml would BLAST tabular format be faster to load, could we support both?

So the main advantages of XML are:

The first three points are a bit of a hassle but nothing major or really blocking.

The last point is quite crucial because without the sequence, it is impossible to calculate meaningful coverage/identity for a target when there are multiple HSPs.. That's why there is a specific function (which is the most time consuming of the whole serialise process) that specifically calculates which positions are identities, which are positive hits, and then makes sure that there are no redundant counts. If we were to dispose of this, the only viable alternative would be to rely on the best HSP coverage/identity for each hit.

This would affect directly three metrics: blast_coverage, blast_identity and snowy_blast_score.

More in general, it would affect the structure of the Mikado database as well (as of course there are places where I consider the alignment strings etc.)

However, profiling clearly show that parsing the XMLs on one hand, and parsing the alignment strings on the other, easily take up ~75% of the time of mikado serialise for blast hits. So there is definitely a case for ditching what written above and just use the less powerful tabular format. This is not just a problem for us: e.g. MEGAN saw its performance increase (by a lot) after starting to support tabular rather than just XML files.

ljyanesm commented 4 years ago

Is there any combination of parameters in https://www.ncbi.nlm.nih.gov/books/NBK279684/#appendices.Options_for_the_commandline_a that might give the output required?

It seems like the format for options 6,7 and 10 can be extended with many values, worth having a look...

lucventurini commented 4 years ago

@ljyanesm

I had not had a look in a long while and yes, I think you are totally right! I am having a look now. Specifically the qseq and sseq keywords seem to give back kinda what we need.

There are still problems (two glaring ones: BLAST+ returns the nucleotide sequence, DIAMOND the translated protein for the query; neither returns the match line which is contained in the XML).

There is the possibility that this could be done properly.

ljyanesm commented 4 years ago

@lucventurini have you checked the "btop" option isn't what you are looking for?

lucventurini commented 4 years ago

@lucventurini have you checked the "btop" option isn't what you are looking for?

It kinda is. I will have to rewrite the parser but it is something doable. Let me see whether I can have something by the end of the day!

ljyanesm commented 4 years ago

This should both improve the parsing speed out of sheer reduction of input set size and allow for parallel parsing, tabular format permits opening the file on "t" locations, determining the location of the first "\n" character and then launching threads to process chunks of the file independently based on shared predetermined "start, end" positions.

lucventurini commented 4 years ago

This should both improve the parsing speed out of sheer reduction of input set size and allow for parallel parsing, tabular format permits opening the file on "t" locations, determining the location of the first "\n" character and then launching threads to process chunks of the file independently based on shared predetermined "start, end" positions.

Yep, I am using a similar strategy both for the ORFs in mikado serialise as well as for mikado pick. From what I am seeing the biggest hurdle is to parse the btop line appropriately.

lucventurini commented 4 years ago

This should both improve the parsing speed out of sheer reduction of input set size and allow for parallel parsing, tabular format permits opening the file on "t" locations, determining the location of the first "\n" character and then launching threads to process chunks of the file independently based on shared predetermined "start, end" positions.

To be precise: parsing the input with pandas takes seconds for whole files, at most.

lucventurini commented 4 years ago

@ljyanesm @swarbred

This snippet seems to function:

import re
from fastnumbers import fast_int, isint
from Bio.SubsMat.MatrixInfo import blosum62

for key, val in blosum62.items():  # The matrix is symmetrical
    b62["".join(key)] = val
    b62["".join(key[::-1])] = val

qpos = 0
ids = set()
poses = set()
qmult = 3

for pos in re.findall("(\d+|[A-Z|-]{2,2})", nrec.btop.values[0]):
    if isint(pos):
        pos = fast_int(pos)
        ids.update(set(range(qpos, qpos + pos * qmult)))
        poses.update(set(range(qpos, qpos + pos * qmult)))
        qpos += pos * qmult
    elif pos[0] == "-":  # Gap in query
        continue
    elif pos[1] == "-":  # Gap in target
        qpos += qmult
    else:
        if b62.get(pos, -1) > 0:
            poses.update(set(range(qpos, qpos + qmult)))
        qpos += qmult

This creates a correct set of positions that are "identical" and others that are "positive".

PS: note that the result being correct relies on using the correct substitution matrix. In most cases this is not a problem (BLOSUM62), but anyway we'll have to modify the configuration and pipeline to ensure that the matrix is always the same.

lijing28101 commented 4 years ago

Hi @srividya22, I also need to split my blast.xml file. Could you share your script to me? Thanks, Jing

srividya22 commented 4 years ago

@lijing28101

I just used the biostar165777.jar jar file from this https://github.com/lindenb/jvarkit repo.

java -jar /seq/schatz/sramakri/sources/jvarkit/dist/biostar165777.jar -o mikado.blast_SPLIT.xml -T Hit -N 20 mikado.blast.xml

Thanks

lucventurini commented 4 years ago

Dear @srividya22 @lijing28101

The master branch now contains all the improvements for the ORF loading. I am keeping the issue open while I continue working on switching to using tabular BLAST data.

Kind regards

lucventurini commented 4 years ago

Hi @swarbred

With 32f3f7d I should have done all the necessary coding for performing the BLAST loading from the tabular files. As I mentioned during the call, now the multiprocessing will be more efficient as each file will be analysed with full multiprocessing (rather than multiprocessing being used to scan files in parallel, one per process). I do have to write and perform tests but I am quite confident.

srividya22 commented 4 years ago

@lucventurini

I reran mikado using one of your earlier commits d5f9130ac782050a572c1750562e5abe8b69cc0c

and mikado serialize finished within a 5 minutes 2020-03-07 21:39:29,054 - serialiser - serialise.py:290 - INFO - setup - MainProcess - Command line: /sonas-hs/schatz/hpc/home/sramakri/miniconda2/envs/mikado/bin/mikado serialise --xml=/seq/schatz/sramakri/genome_annotations/groundcherry/mikado/mikado_only_uniprot/chunked_blastx/mikado_chunk_001.xml.gz,/seq/schatz/sramakri/genome_annotations/groundcherry/mikado/mikado_only_uniprot/chunked_blastx/mikado_chunk_002.xml.gz,/seq/schatz/sramakri/genome_annotations/groundcherry/mikado/mikado_only_uniprot/chunked_blastx/mikado_chunk_003.xml.gz --blast_targets=uniprot_sprot.fasta --start-method=spawn --transcripts=mikado_prepared.fasta --genome_fai=G20.bionano.fasta.fai --json-conf=groundcherry_conf_v2.yaml --force --orfs=../new_transdecoder/mikado_prepared.fasta.transdecoder.bed -od ./ --procs=20 -l mikado_serialize.log 2020-03-07 21:39:29,055 - serialiser - serialise.py:296 - INFO - setup - MainProcess - Random seed: 0 2020-03-07 21:39:29,056 - serialiser - serialise.py:334 - INFO - setup - MainProcess - Using a sqlite database (location: ./mikado.db) 2020-03-07 21:39:29,056 - serialiser - serialise.py:338 - INFO - setup - MainProcess - Requested 20 threads, forcing single thread: False 2020-03-07 21:39:29,056 - serialiser - serialise.py:361 - WARNING - serialise - MainProcess - Removing old data from ./mikado.db because force option in place 2020-03-07 21:39:43,329 - serialiser - serialise.py:140 - INFO - load_orfs - MainProcess - Starting to load ORF data 2020-03-07 21:41:02,184 - serialiser - orf.py:477 - INFO - __serialize_multiple_threads - MainProcess - Finished loading 455864 ORFs into the database 2020-03-07 21:41:07,706 - serialiser - serialise.py:152 - INFO - load_orfs - MainProcess - Finished loading ORF data 2020-03-07 21:41:07,854 - serialiser - serialise.py:106 - INFO - load_blast - MainProcess - Starting to load BLAST data 2020-03-07 21:41:29,048 - serialiser - xml_serialiser.py:331 - INFO - serialize_targets - MainProcess - Started to serialise the targets 2020-03-07 21:41:40,602 - serialiser - xml_serialiser.py:366 - INFO - serialize_targets - MainProcess - Loaded 558898 objects into the "target" table 2020-03-07 21:41:40,610 - serialiser - xml_serialiser.py:260 - INFO - serialize_queries - MainProcess - Started to serialise the queries 2020-03-07 21:41:40,883 - serialiser - xml_serialiser.py:308 - INFO - serialize_queries - MainProcess - Loaded 0 objects into the "query" table 2020-03-07 21:43:48,589 - serialiser - xml_serialiser.py:531 - INFO - serialize - MainProcess - Loaded 0 alignments for 0 queries 2020-03-07 21:43:48,590 - serialiser - xml_serialiser.py:533 - INFO - serialize - MainProcess - Finished loading blast hits 2020-03-07 21:43:48,764 - serialiser - serialise.py:125 - INFO - load_blast - MainProcess - Finished to load BLAST data 2020-03-07 21:43:48,765 - serialiser - serialise.py:81 - INFO - load_junctions - MainProcess - Starting to load junctions: ['/seq/schatz/sramakri/genome_annotations/groundcherry/RNASeq/STAR/portcullis/3-filt/portcullis_filtered.pass.junctions.bed'] 2020-03-07 21:44:44,993 - serialiser - serialise.py:91 - INFO - load_junctions - MainProcess - Loaded junctions 2020-03-07 21:44:44,993 - serialiser - serialise.py:383 - INFO - serialise - MainProcess - Finished

However, it is struck at mikado pick for the past couple of days and it the output I see this error in STDERR

Traceback (most recent call last): File "/sonas-hs/schatz/hpc/home/sramakri/miniconda2/envs/mikado/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/sonas-hs/schatz/hpc/home/sramakri/miniconda2/envs/mikado/lib/python3.6/site-packages/Mikado/picking/loci_processer.py", line 506, in run definition = GtfLine(tjson["definition"]).as_dict() File "/sonas-hs/schatz/hpc/home/sramakri/miniconda2/envs/mikado/lib/python3.6/site-packages/Mikado/parsers/GTF.py", line 46, in init super().init(line, my_line, header=header) File "/sonas-hs/schatz/hpc/home/sramakri/miniconda2/envs/mikado/lib/python3.6/site-packages/Mikado/parsers/gfannotation.py", line 95, in init raise ValueError("Invalid start and end values: {}".format(" ".join(self._fields[3:5]))) ValueError: Invalid start and end values: 6946277 6954031 Process LociProcesser-15: Traceback (most recent call last): File "/sonas-hs/schatz/hpc/home/sramakri/miniconda2/envs/mikado/lib/python3.6/site-packages/Mikado/parsers/gfannotation.py", line 93, in init self.start, self.end = tuple(fast_int(i, default=None) for i in self._fields[3:5]) File "/sonas-hs/schatz/hpc/home/sramakri/miniconda2/envs/mikado/lib/python3.6/site-packages/Mikado/parsers/gfannotation.py", line 93, in self.start, self.end = tuple(fast_int(i, default=None) for i in self._fields[3:5]) SystemError: returned NULL without setting an error

lucventurini commented 4 years ago

@swarbred @ljyanesm

I am now finished in implementing the tabular blast loading. daijin mikado will now by default use the tabular output rather than XML.

Many thanks to @ljyanesm for finding the right way to tackle this issue!

swarbred commented 4 years ago

Hi @lucventurini Can you clarify what tabular output is required for mikado i.e. what specifically do we need blast / diamond to generate given it's not the default -6 output? How robust is this e.g. does requesting additional columns mess this up?

You mentioned you still need to test 32f3f7d let me know when you are satisfied and I can run this on the test regions.

lucventurini commented 4 years ago

what specifically do we need blast / diamond to generate given it's not the default -6 output? How robust is this e.g. does requesting additional columns mess this up?

I will specify the columns when I am satisfied with the speed (not there yet). And yes, additional columns or a different order will mess this up, unfortunately it's a side effect of using the tabular format.

lucventurini commented 4 years ago

Hi @swarbred

Hi @lucventurini Can you clarify what tabular output is required for mikado i.e. what specifically do we need blast / diamond to generate given it's not the default -6 output? How robust is this e.g. does requesting additional columns mess this up?

You mentioned you still need to test 32f3f7d let me know when you are satisfied and I can run this on the test regions.

Thanks to the big help of @ljyanesm today, we have sped the procedure up considerably.

The fields required are as follows (might change in the future I guess):

6 qseqid sseqid pident ppos length mismatch gapopen qstart qend sstart send evalue bitscore qseq sseq btop

Commit f97a3fb49b557d991a5714da329133d07d17201e should be testable on the regions.

lucventurini commented 4 years ago

@swarbred please use fe3714f5 instead, I just updated the function to make it ~5X faster. I will now stop working on this branch, to let you test on the regions, as the performance is I think sufficient.

lucventurini commented 4 years ago

According to bbuchfink/diamond/issues/334, the off-by-one error was in the XML files, not in the tabular format. Amending daijin to reflect this.

lucventurini commented 4 years ago

@swarbred @gemygk @ljyanesm

f7f086d should solve all outstanding issues for this branch. If you could check so that we can merge, that would be great.

@lijing28101 @srividya22 Thank you for your patience on this. @lijing28101 , @ljyanesm and I are now working on implementing intron redundancy checks in mikado prepare (see #270). Once these two issues are resolved, hopefully your (very large!) dataset can be dealt with properly. Apologies for the delayed responses but as you can imagine the coronavirus is impacting us all here in the UK.

swarbred commented 4 years ago

@lucventurini query

  -p PROCS, --procs PROCS
                        Number of threads to use for analysing the BLAST
                        files. This number should not be higher than the total
                        number of XML files.

is this still true with the tsv format? Presumably this no longer needs to be the case.

I ask as the scaffold 6 test region is taking longer than the single xml file took to load

lucventurini commented 4 years ago

Hi @swarbred, No, you can use as many processes as you want, it's not tied to the number of files. When you say that it's taking longer, do you mean that it finishes, or has it hung /crashed?

Also, how big are the tabular files? They really should not take much time at all for a small-ish region.

lucventurini commented 4 years ago

@swarbred

Commit should solve the issue. I have reverted to the model of analysing one file per process, as the bottleneck was transferring the data across. So, like with the XMLs, the ideal is to generate at least as many chunks as the processes one wants to launch.

I was able to load 4,384,915 HSPs in 3,955,985, across 20 files, in ~10 minutes. Memory usage was ~3GB, it increases linearly with the number of files, but it is still very low. The main bottleneck at the moment is writing down the results into the SQLite database.

JobID|JobIDRaw|JobName|Partition|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|AveVMSize|MaxRSS|MaxRSSNode|MaxRSSTask|AveRSS|MaxPages|MaxPagesNode|MaxPagesTask|AvePages|MinCPU|MinCPUNode|MinCPUTask|AveCPU|NTasks|AllocCPUS|Elapsed|State|ExitCode|AveCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqMem|ConsumedEnergy|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|AveDiskRead|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|AveDiskWrite|AllocGRES|ReqGRES|ReqTRES|AllocTRES|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutAve|TRESUsageOutTot
21047.1|21047.1|bash||55802304K|nhm-hpc-001|0|55802304K|18388056K|nhm-hpc-001|0|18388056K|0|nhm-hpc-001|0|0|00:52:53|nhm-hpc-001|0|00:03:56|1|21|00:10:42|COMPLETED|0:0|784K|Unknown|Unknown|Unknown|2Gn|0|4186.07M|nhm-hpc-001|0|2833.50M|2333.34M|nhm-hpc-001|0|1524.14M||||cpu=21,mem=0,node=1|cpu=00:03:56,energy=0,fs/disk=2971144967,mem=18388056K,pages=0,vmem=55802304K|cpu=00:52:53,energy=0,fs/disk=4389412775,mem=18388056K,pages=0,vmem=55802304K|cpu=nhm-hpc-001,energy=nhm-hpc-001,fs/disk=nhm-hpc-001,mem=nhm-hpc-001,pages=nhm-hpc-001,vmem=nhm-hpc-001|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:52:53,energy=0,fs/disk=4389412775,mem=18388056K,pages=0,vmem=55802304K|cpu=nhm-hpc-001,energy=nhm-hpc-001,fs/disk=nhm-hpc-001,mem=nhm-hpc-001,pages=nhm-hpc-001,vmem=nhm-hpc-001|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:03:56,energy=0,fs/disk=2971144967,mem=18388056K,pages=0,vmem=55802304K|energy=0,fs/disk=2446689300|energy=nhm-hpc-001,fs/disk=nhm-hpc-001|fs/disk=0|energy=0,fs/disk=1598177316|energy=0,fs/disk=1598177316
swarbred commented 4 years ago

These were the timings running mikado-2.0rc6_f7f086d_CBG -p 32 and a second run -p 1

**-p32**
2020-03-21 14:20:03,073 - serialiser - serialise.py:107 - INFO - load_blast - MainProcess - Starting to load BLAST data
2020-03-21 14:20:03,081 - serialiser - blast_serialiser.py:79 - INFO - __init__ - MainProcess - Number of dedicated workers: 32
2020-03-21 14:20:07,739 - serialiser - blast_serialiser.py:242 - INFO - __serialize_targets - MainProcess - Started to serialise the targets
2020-03-21 14:20:10,016 - serialiser - blast_serialiser.py:277 - INFO - __serialize_targets - MainProcess - Loaded 127628 objects into the "target" table
2020-03-21 14:20:10,019 - serialiser - blast_serialiser.py:171 - INFO - __serialize_queries - MainProcess - Started to serialise the queries
2020-03-21 14:20:10,061 - serialiser - blast_serialiser.py:219 - INFO - __serialize_queries - MainProcess - Loaded 0 objects into the "query" table
2020-03-21 14:20:11,600 - serialiser - tab_serialiser.py:35 - INFO - _serialise_tabular - MainProcess - Creating a pool with 32 workers for analysing BLAST results
2020-03-21 14:20:11,954 - serialiser - tab_serialiser.py:41 - WARNING - _serialise_tabular - MainProcess - Analysing mikado_prepared_matches.tsv
2020-03-21 18:49:57,892 - serialiser - tab_serialiser.py:54 - INFO - _serialise_tabular - MainProcess - Finished loading blast hits
2020-03-21 18:49:58,122 - serialiser - serialise.py:125 - INFO - load_blast - MainProcess - Finished to load BLAST data
**-p 1**
2020-03-21 17:01:41,141 - serialiser - serialise.py:107 - INFO - load_blast - MainProcess - Starting to load BLAST data
2020-03-21 17:01:41,146 - serialiser - blast_serialiser.py:79 - INFO - __init__ - MainProcess - Number of dedicated workers: 1
2020-03-21 17:01:45,616 - serialiser - blast_serialiser.py:242 - INFO - __serialize_targets - MainProcess - Started to serialise the targets
2020-03-21 17:01:47,601 - serialiser - blast_serialiser.py:277 - INFO - __serialize_targets - MainProcess - Loaded 127628 objects into the "target" table
2020-03-21 17:01:48,610 - serialiser - blast_serialiser.py:171 - INFO - __serialize_queries - MainProcess - Started to serialise the queries
2020-03-21 17:01:48,648 - serialiser - blast_serialiser.py:219 - INFO - __serialize_queries - MainProcess - Loaded 0 objects into the "query" table
2020-03-21 17:01:49,414 - serialiser - blast_serialiser.py:226 - INFO - __serialize_queries - MainProcess - 88513 in queries
2020-03-21 17:01:50,957 - serialiser - tab_serialiser.py:41 - WARNING - _serialise_tabular - MainProcess - Analysing mikado_prepared_matches.tsv
2020-03-21 18:33:57,340 - serialiser - tab_serialiser.py:54 - INFO - _serialise_tabular - MainProcess - Finished loading blast hits
2020-03-21 18:33:57,474 - serialiser - serialise.py:125 - INFO - load_blast - MainProcess - Finished to load BLAST data

These tests are under /ei/workarea/group-ga/Projects/CB-GENANNO-470_Mikado_test_datasets/Analysis/mikado-2.0rc6_f7f086d_CBG/

So in these run the -p1 run was ~1hr30 the -p32 ~4hr30

Commit should solve the issue. I have reverted to the model of analysing one file per process as the bottleneck was transferring the data across So, like with the XMLs, the ideal is to generate at least as many chunks as the processes one wants to launch.

Sorry I don't follow why we need the user to generate multiple files, I thought part of the reason for moving to tabular was that we could effectively internally chunk the file.

lucventurini commented 4 years ago

Hi @swarbred ,

Fixed in e519ce44.

Script: /ei/workarea/group-ga/Projects/CB-GENANNO-470_Mikado_test_datasets/Analysis/mikado-2.0rc6_f7f086d_CBG/test.sh Results: /ei/workarea/group-ga/Projects/CB-GENANNO-470_Mikado_test_datasets/Analysis/mikado-2.0rc6_f7f086d_CBG/test/mikado.db Runtime:

$ sacct -j 26106300 -l -P
JobID|JobIDRaw|JobName|Partition|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|AveVMSize|MaxRSS|MaxRSSNode|MaxRSSTask|AveRSS|MaxPages|MaxPagesNode|MaxPagesTask|AvePages|MinCPU|MinCPUNode|MinCPUTask|AveCPU|NTasks|AllocCPUS|Elapsed|State|ExitCode|AveCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqMem|ConsumedEnergy|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|AveDiskRead|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|AveDiskWrite|AllocGRES|ReqGRES|ReqTRES|AllocTRES
26106300|26106300|mikado_seri_test|ei-long||||||||||||||||||32|00:14:06|COMPLETED|0:0||Unknown|Unknown|Unknown|30000Mn|0|||||||||||cpu=32,mem=30000M,node=1|cpu=32,mem=30000M,node=1,billing=32
26106300.batch|26106300.batch|batch||62589928K|t128n19|0|55529736K|15223904K|t128n19|0|8150492K|241K|t128n19|0|241K|01:08:19|t128n19|0|01:08:19|1|32|00:14:06|COMPLETED|0:0|959.92M|0|0|0|30000Mn|0|14034.57M|t128n19|0|14034.57M|3344.19M|t128n19|0|3344.19M||||cpu=32,mem=30000M,node=1
26106300.extern|26106300.extern|extern||178348K|t128n19|0|107956K|1464K|t128n19|0|352K|0|t128n19|0|0|00:00:00|t128n19|0|00:00:00|1|32|00:14:06|COMPLETED|0:0|800M|0|0|0|30000Mn|0|0.00M|t128n19|0|0.00M|0|t128n19|65534|0||||cpu=32,mem=30000M,node=1,billing=32

This took 14:06 minutes using 32 cores and 15.2GB of RAM.

I also ran a second test, using for the configuration the new default value of 10^7 for caching objects in memory (instead of 10^5). This version took 2 minutes less (12:11) while using minimally more memory (16.9 GB of RAM).

Sanity check on the databases:

$ cut -f 1-2 mikado_prepared_matches.tsv | uniq | wc -l
1033816  # Hits
$ wc -l mikado_prepared_matches.tsv
1210499 mikado_prepared_matches.tsv  # HSPs
$ sqlite3 test/mikado.db "select count(*) from hit; select count(*) from hsp"
1033816  # First test, 10^5 max objects
1210499
$ sqlite3 n_test/mikado.db "select count(*) from hit; select count(*) from hsp"
1033816  # Second test, 10^5 max objects
1210499
lijing28101 commented 4 years ago

Hi @lucventurini, I tried my data with tabular blast result, and install the branch 280. But there is error when I run serialise. My code is

mikado serialise --start-method spawn --procs 36 --blast_targets Zm-B73-REFERENCE-NAM-5.0.fa --json-conf configuration.yaml --xml test.tsv --orfs mikado_prepared.fasta.transdecoder.bed -mr 1

The error is

2020-03-23 13:23:07,610 - main - __init__.py:120 - ERROR - main - MainProcess - Mikado crashed, cause:
2020-03-23 13:23:07,610 - main - __init__.py:121 - ERROR - main - MainProcess - Still working on this
Traceback (most recent call last):
  File "/work/LAS/mash-lab/jing/bin/anaconda3/envs/mikado2/lib/python3.7/site-packages/Mikado/__init__.py", line 106, in main
    args.func(args)
  File "/work/LAS/mash-lab/jing/bin/anaconda3/envs/mikado2/lib/python3.7/site-packages/Mikado/subprograms/serialise.py", line 380, in serialise
    load_blast(args, logger)
  File "/work/LAS/mash-lab/jing/bin/anaconda3/envs/mikado2/lib/python3.7/site-packages/Mikado/subprograms/serialise.py", line 124, in load_blast
    part_launcher(filenames)
  File "/work/LAS/mash-lab/jing/bin/anaconda3/envs/mikado2/lib/python3.7/site-packages/Mikado/subprograms/serialise.py", line 54, in xml_launcher
    xml_serializer()
  File "/work/LAS/mash-lab/jing/bin/anaconda3/envs/mikado2/lib/python3.7/site-packages/Mikado/serializers/blast_serializer/blast_serialiser.py", line 381, in __call__
    self.serialize()
  File "/work/LAS/mash-lab/jing/bin/anaconda3/envs/mikado2/lib/python3.7/site-packages/Mikado/serializers/blast_serializer/blast_serialiser.py", line 368, in serialize
    self.__serialise_tabular()
  File "/work/LAS/mash-lab/jing/bin/anaconda3/envs/mikado2/lib/python3.7/site-packages/Mikado/serializers/blast_serializer/blast_serialiser.py", line 372, in __serialise_tabular
    raise NotImplementedError("Still working on this")
NotImplementedError: Still working on this

The serialise.log have no error message

2020-03-23 13:21:51,894 - serialiser - serialise.py:290 - INFO - setup - MainProcess - Command line: /work/LAS/mash-lab/jing/bin/anaconda3/envs/mikado2/bin/mikado serialise --start-method spawn --procs 36 --blast_targets Zm-B73-REFERENCE-NAM-5.0.fa --json-conf configuration.yaml --xml test.tsv --orfs mikado_prepared.fasta.transdecoder.bed -mr 1
2020-03-23 13:21:51,894 - serialiser - serialise.py:296 - INFO - setup - MainProcess - Random seed: 0
2020-03-23 13:21:51,897 - serialiser - serialise.py:334 - INFO - setup - MainProcess - Using a sqlite database (location: /work/LAS/mash-lab/jing/maize/mikado_merged/mikado.db)
2020-03-23 13:21:51,897 - serialiser - serialise.py:338 - INFO - setup - MainProcess - Requested 36 threads, forcing single thread: False
2020-03-23 13:21:51,897 - serialiser - serialise.py:140 - INFO - load_orfs - MainProcess - Starting to load ORF data
2020-03-23 13:22:58,236 - serialiser - orf.py:477 - INFO - __serialize_multiple_threads - MainProcess - Finished loading 484706 ORFs into the database
2020-03-23 13:23:01,645 - serialiser - serialise.py:152 - INFO - load_orfs - MainProcess - Finished loading ORF data
2020-03-23 13:23:01,735 - serialiser - serialise.py:106 - INFO - load_blast - MainProcess - Starting to load BLAST data
2020-03-23 13:23:07,477 - serialiser - blast_serialiser.py:240 - INFO - __serialize_targets - MainProcess - Started to serialise the targets
2020-03-23 13:23:07,483 - serialiser - blast_serialiser.py:275 - INFO - __serialize_targets - MainProcess - Loaded 685 objects into the "target" table
2020-03-23 13:23:07,484 - serialiser - blast_serialiser.py:169 - INFO - __serialize_queries - MainProcess - Started to serialise the queries
2020-03-23 13:23:07,588 - serialiser - blast_serialiser.py:217 - INFO - __serialize_queries - MainProcess - Loaded 0 objects into the "query" table

Is my code correct for tabular blast result? My blast output format is -outfmt "6 qseqid sseqid pident ppos length mismatch gapopen qstart qend sstart send evalue bitscore qseq sseq btop"

lucventurini commented 4 years ago

Hi @lijing28101 Unfortunately you are using an old commit (by over one week), where work on the tabular data had not begun yet. The most recent commit and we hope definitive is e519ce4.

To use it, the format of the BLAST is the following: qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore ppos btop.

I hope this helps.

swarbred commented 4 years ago

@lucventurini Repeated on the data mentioned in #263 that gave the following results (before the recent ORF and homology improvements)

The 4 hours 52 was for the serialise (-p 32) alone

breakdown is

~2hrs 30 ORF loading
~2hrs BLAST loading
~8 mins junction loading  

running after the improvements with a pre-chunked file (32)

~9 mins ORF loading
~2 hrs 14 BLAST loading
~12 mins  junction loading 

without pre-chunking

~3 mins ORF loading
~1hr 32 BLAST loading
~ 9 mins junction loading 

So this doesn't seem to show the same speed benefits as you reported above, the diamond file is 15646555 lines.