Exporting library results from hdf5 to tsv uses a single thread (and therefore takes 95% of processing time, ie 30 min)) - Githubissues

MannLabs / alphapeptdeep

Deep learning framework for proteomics

Apache License 2.0

102 stars 20 forks source link

Exporting library results from hdf5 to tsv uses a single thread (and therefore takes 95% of processing time, ie 30 min)) #90

Closed gsaxena888 closed 1 year ago

gsaxena888 commented 1 year ago

When exporting library results from hdf to tsv, only one thread is being used. As a result, on my 212 core google cloud VM, only < 2 minutes or so is spent on the "real" work (ie predicting rt and fragmentation intensity), and about 30 minutes is spent (single threaded) on the exporting from hdf to tsv.

Is there any way in which that export can be done in parallel? (I'm a pure Java programmer with near-zero Python skills, but in Java I've solved a problem similar to this by running 99% of the export logic in parallel, but then added a synchronization statement for the actual writing out to file, since multiple simultaneous writes would corrupt the file etc; if that's too hard or too risky to do etc., is it possible to export n tsv files, where n is the number of threads -- and each tsv is suffixed somehow with thread id/number? This has the added advantage that it's easy to read the files in parallel as well.)

Thoughts?

(FYI: In case you're interested, I tried reading the hdf file in Java directly, but there appears to be some sort of Python-only library for the string field in hdf that causes problems in non Python languages. I've documented the issue on SO: https://stackoverflow.com/questions/74995561/cant-view-string-fields-in-an-hdf5-file/75034940#75034940 ...that sounds like someone from HDF/Python/Java may need to update their libraries, so it'll probably be a long while, which is why I'm hoping it's easy and safe to somehow export the tsv files in parallel...)

jalew188 commented 1 year ago

Yes, writing tsv to disk is extremely slow, although I already used a writing process to write TSV file

jalew188 commented 1 year ago

I have update some information in the SO, hope it helps

jalew188 commented 1 year ago

@gsaxena888 I am also curious how other languages can access this HDF data, would you please share here if if you have any solutions? I will also check something by myself, for example C# HDF

jalew188 commented 1 year ago

@gsaxena888 It may work without AlphaPeptDee's built-in HDF5 functionalities. We can either use df.to_hdf() or df.to_parquet() to save files into a generic format, then maybe it can be loaded by Java.

gsaxena888 commented 1 year ago

@jalew188 If (and assuming it's easy to do) one could save it to a format like parquet or a more generic hdf format, I'd be more than happy to test extraction via Java (I'm fast & can write fairly efficient code in Java).

jalew188 commented 1 year ago

Then you can try this: In python:

from peptdeep.protein.fasta import PredictSpecLibFasta

lib = PredictSpecLibFasta()
lib.load_hdf('xxx.hdf', load_mod_seq=True)
lib.precursor_df.to_parquet('peptide_df.parquet')
lib.fragment_mz_df.to_parquet('frag_mz_df.parquet')
lib.fragment_intensity_df.to_parquet('frag_inten_df.parquet')

In java

load the parquet files.

The only problem is that I don't know whether string array is stored as ascii (1 byte) or unicode (4 bytes) arrays.

jalew188 commented 1 year ago

@gsaxena888 see PR https://github.com/MannLabs/alphabase/pull/86

gsaxena888 commented 1 year ago

So, the good news is that the code

from peptdeep.protein.fasta import PredictSpecLibFasta

lib = PredictSpecLibFasta()
lib.load_hdf('xxx.hdf', load_mod_seq=True)
lib.precursor_df.to_parquet('peptide_df.parquet')
lib.fragment_mz_df.to_parquet('frag_mz_df.parquet')
lib.fragment_intensity_df.to_parquet('frag_inten_df.parquet')

runs and produces output, at least for a small hdf file.

However, for my "real" hdf file of 5GB, it runs but produces a tiny output files. (The hdf file is 5GB; but the partquet files are a few megabytes; further, reading the parquet files in Java shows only ~58k peptide rows, instead of millions of rows). There were no warning or error messages. I've attached the hdf file here as a link:

https://drive.google.com/file/d/1qduTxgnBqXe2qrdmwE9md8ZZcoK-TcDI/view?usp=sharing

(Also, if one renames the file from predict.speclib.hdf to something different, eg BIGpredict.speclib.hdf, the python code throws errors.)

Thoughts?