Noble-Lab / casanovo

De Novo Mass Spectrometry Peptide Sequencing with a Transformer Model
https://casanovo.readthedocs.io
Apache License 2.0
91 stars 32 forks source link

ValueError: could not broadcast input array from shape (0,) into shape (25714,) #354

Open MNTsnowman opened 1 week ago

MNTsnowman commented 1 week ago

Hi Casanovo

This is the first time i'm attempting to use casanovo, i have tried to follow your guide at : https://casanovo.readthedocs.io/en/latest/getting_started.html

I'm getting this error (see below). I'm wondering if it could have something to do with the headders of the scans in the mzML files, if this sounds like a possibility, could you please provide the command line settings you guys are using for generating the mzML files and how you name and structure the headder?

D:...\De Novo>casanovo sequence -m WorkDir\casanovo_massivekb.ckpt -c WorkDir\casanovo_config.yaml Data\mzML\14-2-NM_S4-A1_1_9156.mzML WARNING: Dataloader multiprocessing is currently not supported on Windows or MacOS; using only a single thread. Seed set to 454 INFO: Casanovo version 4.2.1 INFO: Sequencing peptides from: INFO: Data\mzML\14-2-NM_S4-A1_1_9156.mzML GPU available: False, used: False TPU available: False, using: 0 TPU cores HPU available: False, using: 0 HPUs INFO: Reading 1 files... Data\mzML\14-2-NM_S4-A1_1_9156.mzML: 100%|█████████████████████████████████| 27193/27193 [00:32<00:00, 835.91spectra/s] WARNING: Skipped 25714 spectra with invalid precursor info Traceback (most recent call last): File "C:\Users...\casanovo_env\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users...\casanovo_env\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\Users...\casanovo_env\Scripts\casanovo.exe__main.py", line 7, in File "C:\Users...\casanovo_env\lib\site-packages\rich_click\rich_command.py", line 367, in call return super().call(*args, **kwargs) File "C:\Users...\casanovo_env\lib\site-packages\click\core.py", line 1157, in call return self.main(*args, kwargs) File "C:\Users...\casanovo_env\lib\site-packages\rich_click\rich_command.py", line 152, in main rv = self.invoke(ctx) File "C:\Users...\casanovo_env\lib\site-packages\click\core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "C:\Users...\casanovo_env\lib\site-packages\click\core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) File "C:\Users...\casanovo_env\lib\site-packages\click\core.py", line 783, in invoke return callback(*args, **kwargs) File "C:\Users...\casanovo_env\lib\site-packages\casanovo\casanovo.py", line 143, in sequence runner.predict(peak_path, output) File "C:\Users...\casanovo_env\lib\site-packages\casanovo\denovo\model_runner.py", line 160, in predict test_index = self._get_index(peak_path, False, "") File "C:\Users...\casanovo_env\lib\site-packages\casanovo\denovo\model_runner.py", line 394, in _get_index return Index(index_fname, filenames, valid_charge=valid_charge) File "C:\Users...\casanovo_env\lib\site-packages\depthcharge\data\hdf5.py", line 104, in init self.add_file(ms_file) File "C:\Users...\casanovo_env\lib\site-packages\depthcharge\data\hdf5.py", line 195, in add_file metadata = self._assemble_metadata(parser) File "C:\Users...\casanovo_env\lib\site-packages\depthcharge\data\hdf5.py", line 173, in _assemble_metadata metadata["scan_id"] = parser.scan_id ValueError: could not broadcast input array from shape (0,) into shape (25714,)

bittremieux commented 1 week ago

I suspect that all of the spectra were skipped:

WARNING: Skipped 25714 spectra with invalid precursor info

You already indicated that you suspected something wrong with the scan headers. Did you modify them in some way?

Normally standard mzML files produced by MSConvert, ThermoRawFileParser, etc. should all work. We do not edit the mzML files or the headers in there at all.

MNTsnowman commented 1 week ago

Hi @bittremieux

Yes i suspect the headders as my data orriginates from a timsTOF with the IM engaged. I don't think that the IM is to blame as it is handeled in the conversion (see command below). Given that the data is from a timsTOF I do not think the ThermoRawFileParser is used at all.

For info, the CMD command i use to generate the mzML files is something along the lines of this : "C:\Users...\ProteoWizard 3.0.23167.44089af 64-bit\msconvert.exe" --combineIonMobilitySpectra --filter "peakPicking vendor msLevel=1-" --filter "scanSumming precursorTol=0.05 scanTimeTol=5 ionMobilityTol=0.1 sumMs1=0" --filter "titleMaker ... File:"""^<SourcePath^>""", NativeID:"""^<Id^>""""

So given that it skips all the scans, and that it states that the precursor info is invalid, i was wondering what your settings were to generate the scan title, in other words what is your "titlemaker" part of your conversion command. I hope this makes sense. Also, please let me know if you have other suggestions for what could be wrong. :)

bittremieux commented 1 week ago

I have limited hands-on experience with timsTOF conversion to mzML, so I don't know how the titleMaker filter should be used. But I'd be surprised if that's the problem. I suspect something about the IM actually.

Can you share the mzML file here to have a look at?

MNTsnowman commented 1 week ago

Unfortunately I'm unable to share a file here. If you have an E-mail we could continue the conversation over we could maybe figure something out.

Alternatively I could try to compare the headers of your demo data with my data.

bittremieux commented 1 week ago

You can email me at wout.bittremieux@uantwerpen.be.