Noble-Lab / casanovo

De Novo Mass Spectrometry Peptide Sequencing with a Transformer Model
https://casanovo.readthedocs.io
Apache License 2.0
90 stars 31 forks source link

Enhanced Sequencing Run Logging #343

Open Lilferrit opened 2 weeks ago

Lilferrit commented 2 weeks ago

Implemented additional logging functionality in order to provide an enhanced end of sequence run log. This takes the form of an end of sequencing run report, an example is shown below:

INFO: ======= Sequencing Run Report =======
INFO: Sequencing Run Start Timestamp: 1719004519s
INFO: Sequencing Run End Timestamp: 1719004539s
INFO: Time Elapsed: 20s
INFO: Executed Command: D:\anaconda3\envs\casanovo_env\Scripts\casanovo sequence sample_data\sample_preprocessed_spectra.mgf -o foo.mztab
INFO: Executed on Host Machine: DESKTOP-P03U1SR
INFO: Sequencing run date: 06/21/24 14:15:39
INFO: Sequenced 128 spectra
INFO: Sequence Score CMF: {0.0: 96, 0.5: 96, 0.9: 66, 0.95: 60, 0.99: 8}
INFO: Max Sequence Length: 17
INFO: Min Sequence Length: 6
INFO: Max GPU Memory Utilization: 400mb

All the data to generate this report is recorded on the fly as inference is conducted (i.e. no file parsing). To facilitate this the functionality of the data submodule was extended to add infastructure for run time logging. This will hopefully make it relatively easy to further extend logging and other io functionality in the future.

bittremieux commented 1 week ago

Changed base to dev, make sure to reflect that locally.

Lilferrit commented 1 week ago

Added more information to the end of run report, namely the number of skipped spectra. Unfortunately the model never sees the spectra that are skipped due to pre-check errors such as having invalid precursor info, so these aren't reflected in the end of run report. They do however appear at other points in the log. A sample log is available below:

INFO: ======= Sequencing Run Report =======
INFO: Sequencing Run Start Timestamp: 1719271695s
INFO: Sequencing Run End Timestamp: 1719271714s
INFO: Time Elapsed: 18s
INFO: Executed Command: D:\anaconda3\envs\casanovo_env\Scripts\casanovo sequence sample_data\sample_preprocessed_spectra.mgf -o data/foo -c casanovo.yaml
INFO: Executed on Host Machine: DESKTOP-P03U1SR
INFO: Sequencing run date: 06/24/24 16:28:34
INFO: Attempted to sequence 128 spectra
INFO: Sequenced 72 spectra
INFO: Skipped 56 spectra
INFO: Sequenced 56.25% of total spectra
INFO: Skipped 43.75% of total spectra
INFO: Score Distribution:
INFO: 54 spectra (75.00%) scored >= 0.0
INFO: 54 spectra (75.00%) scored >= 0.5
INFO: 49 spectra (68.06%) scored >= 0.9
INFO: 46 spectra (63.89%) scored >= 0.95
INFO: 8 spectra (11.11%) scored >= 0.99
INFO: Max Sequence Length: 9
INFO: Min Sequence Length: 6
INFO: Max GPU Memory Utilization: 398mb
codecov[bot] commented 1 week ago

Codecov Report

Attention: Patch coverage is 92.98246% with 4 lines in your changes missing coverage. Please review.

Project coverage is 90.02%. Comparing base (70ea9fc) to head (4942a48). Report is 12 commits behind head on dev.

Files Patch % Lines
casanovo/utils.py 91.83% 4 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## dev #343 +/- ## ========================================== + Coverage 89.77% 90.02% +0.24% ========================================== Files 12 12 Lines 929 982 +53 ========================================== + Hits 834 884 +50 - Misses 95 98 +3 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

bittremieux commented 1 week ago

Can you describe a bit what the logging_io and prediction_io are and how they're intended to be used? You mentioned it during the meeting yesterday, but it would be useful to have it briefly written out as well.

Lilferrit commented 1 week ago

Can you describe a bit what the logging_io and prediction_io are and how they're intended to be used? You mentioned it during the meeting yesterday, but it would be useful to have it briefly written out as well.

For sure, prediction_io contains the definition for the PredictionWriter interface as well as the definition for PredictionMultiWriter. For the most part, all of the functions in the PredictionWriter class were already implemented by MztabWriter. The PredictonWriter interface defines three member functions, log_prediction, log_skipped_spectra, and save which are all optional to implement by a class that implements the interface. The PredictionWriter interface is intended to provide a consistent interface for writing Casanovo sequence predictions to external IO (namely files and loggers).

PredictionMultiWriter implements the PredictionWriter interface, and maintains an internal list of PredictionWriters. When one of the PredictionWriter functions is called on a PredictionMultiWriter it'll call that member function on all of the PredictionWriters in it's internal list.

logger_io contains the definition of LogPredictionWriter, a PredictionWriter that is used to generate the end of sequencing run report. It maintains a table of predictions (which contains only the predicted peptide sequence and the search engine score at the time of writing), and writes to this table when log_prediction is called. When save is called it writes the end of run report to a logger object that is provided via the constructor.

bittremieux commented 1 week ago

Imo this is a bit over-engineered. Instead, we can just log directly where specific information is available using the Python loggers, rather than having to pass it all upstream to these new interfaces. Your implementation is precisely with the Python logger does after all: having a universal interface that multiple writers can hook into and write to different outputs. So we don't need to reinvent that.

So I'd just log the PSM statistics in the MzTabWriter, which is where you have the PSMs. Skipped spectra are logged in the reader, which is where the skipping actually happens (or will happen after the DepthCharge upgrade). (The current spectrum skipping logging is also incorrect and adds non-negligible overhead, which should be avoid for only logging.) Other stuff can similarly be logged in other relevant locations, etc.

Lilferrit commented 1 week ago

Oh ok sounds good - so just to be clear should all of the PSM processing (e.g. the score distribution calculation) and logging happen in the MzTabWriter? Also, should skipped spectra logging just be delayed until the depth charge upgrade is implemented?

bittremieux commented 1 week ago

Yes, I think that's the logical place, considering that all PSMs are aggregated there.

Of course the over-engineering is my personal feeling. We could ask @wfondrie to weigh in as well.

Logging the number of skipped spectra can indeed be delayed for now.

Lilferrit commented 1 week ago

I agree that this solution is overengineered for addressing just the post run logging. My reasoning for implementing it this way is having a PSM io interface would make it relatively easy to extend the PSM io in general. For example if we want to support other output formats like parquet or csv in the future and add a command line option to specify multiple output formats (say you want to write the PSMs to csv and mzTab) it would be relatively easy to do this, or if an end user wants to write PSMs to an external component like a database they could write their own PredictionWriter module to handle this.

bittremieux commented 1 week ago

optimization

Lilferrit commented 1 week ago

Fair enough, I'll get everything moved to the mzTab writer.