eeko-kon / pyOpenMS_UmetaFlow

Apache License 2.0
2 stars 1 forks source link

SiriusMSFile #2

Closed oliveralka closed 3 years ago

oliveralka commented 3 years ago

https://github.com/eeko-kon/py4e/blob/master/Workflownew.py#L56

This comes from the SiriusMSFile Class, since you would like to store a .ms file (internally - in memory). https://github.com/OpenMS/OpenMS/blob/develop/src/pyOpenMS/pxds/SiriusMSFile.pxd

python:

Cython signature: void store(MSExperiment & spectra, String & msfile,
FeatureMapping_FeatureToMs2Indices & feature_ms2_spectra_map, bool & feature_only, 
int & isotope_pattern_iterations, bool no_mt_info, 
libcpp_vector[SiriusMSFile_CompoundInfo] v_cmpinfo)

C++:

// write msfile and store the compound information in CompoundInfo Object
vector<SiriusMSFile::CompoundInfo> v_cmpinfo;
bool feature_only = (sirius_algo.getFeatureOnly() == "true") ? true : false;
bool no_mt_info = (sirius_algo.getNoMasstraceInfoIsotopePattern() == "true") ? true : false;
int isotope_pattern_iterations = sirius_algo.getIsotopePatternIterations();
SiriusMSFile::store(spectra,
                        sirius_tmp.getTmpMsFile(),
                        feature_mapping,
                        feature_only,
                        isotope_pattern_iterations,
                        no_mt_info,
                        v_cmpinfo);

In general, you can check the parameter also in the documentation if you do not know what it is doing and why? https://abibuilder.informatik.uni-tuebingen.de/archive/openms/Documentation/nightly/html/UTILS_SiriusAdapter.html

// here you instantiate a CompoundInfo object, which is used to store additional metadata, which is 
// not parseable after the SIRIUS call anymore.
vector<SiriusMSFile::CompoundInfo> v_cmpinfo;

// this is a parameter, which is called "feature_only" 
// It is a boolean value (true/false) and if it is true you are using the  the feature information 
// from in_featureinfo to reduce the search space to MS2 associated with a feature.
// this is recommended when working with featureXML input, if you do NOT use it 
// sirius will use every individual MS2 spectrum for estimation (and it will take ages)
bool feature_only = (sirius_algo.getFeatureOnly() == "true") ? true : false;

// This boolean value can lead to discarding the masstrace information from a feature and will usethe isotope_pattern_iterations instead -> so in your case it should be false - since you would like to use to feature information.
bool no_mt_info = (sirius_algo.getNoMasstraceInfoIsotopePattern() == "true") ? true : false;

// this will get the standard parameter value of isotope_pattern_iterations, which means that if a feature does not have any information available it will try to look for an isotope pattern in C13 distance with a max iterations of 3. Be careful here, if you use to many iteration you probably will pick up some noise later on. 
int isotope_pattern_iterations = sirius_algo.getIsotopePatternIterations();
eeko-kon commented 3 years ago

Amazing, thank you! This is very helpful. I will work on it today and update the workflow. I am having some issues but I'll give it a try to solve them before asking you :)

eeko-kon commented 3 years ago

Alright so this is what I have so far:

Sirius= SiriusMSFile() argument1= exp argument2= SiriusTemporaryFileSystemObjects.getTmpMsFile() argument3= FeatureMapping_FeatureToMs2Indices() feature_only= True #SiriusAdapterAlgorithm.getFeatureOnly()==True Isotope_iter= 3 #SiriusAdapterAlgorithm.getIsotopePatternIterations() Isotopemasstraceinfo= False CompoundInfo= [] Sirius.store(exp, argument2, argument3, feature_only, True, 3, False, CompoundInfo)

This gives me the following error: Traceback (most recent call last): File "", line 1, in File "pyopenms/pyopenms_6.pyx", line 8113, in pyopenms.pyopenms_6.SiriusMSFile.store AssertionError: arg msfile wrong type

Which I believe comes from argument2, which requires an argument or just a simpler version. I've simply tried to call "siriustest.ms" or something similar but nothing works so far.

oliveralka commented 3 years ago

Did you figure it out? I will take a look, what pyopenms version are you using?

edit: Could you please provide the example data you are currently using to test the workflow prototype in the repository? Then I can run it, in the current configuration.

eeko-kon commented 3 years ago

The pyopenms version is 2.5.0. Unfortunately I did not figure it out yet but I m having some inconvenient issues with my editor so trying to fix that too. I would expect that a file *.ms would be fine, but somehow it doesn't like that :D

I am using the GermicidinAstandard.mzML from https://drive.google.com/drive/folders/1O0JmZa17oqyzObAjphbXxyHmE9LF6Tkf?usp=sharing here.

eeko-kon commented 3 years ago

Alright, so I think I m starting to get the idea when looking at the cpp script. Thank you so much for the guidance so far. It really helped a lot!

In case you could help, now I'm having the following issue:

At the store step, I am getting a "Segmentation fault 11 / Core dumped". I also ran the script in the shared machine that we have (big memory) in case it is a storage issue, but the error persists, so it must be one of the arguments that I am calling and I suspect that it is the String(sirius_tmp.getTmpDir()), because when I run it on its own I am getting this error:

String(sirius_tmp.getTmpDir()) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: __repr__ returned non-string (type bytes)

I don't think this is something a String() class can do. Or is there something I am missing?

oliveralka commented 3 years ago

TypeError: __repr__ returned non-string (type bytes)

There seems to be an error with the type the algorithm is getting. In python you could use a print statement to check what is the current return value: e.g. print(String(sirius_tmp.getTmpDir()))

It will probably give you a byte string (b' ') which is a type in python 3, this type can be converted to a regular string.

You can try to decode the byte string via decode('utf-8')

This should then look somewhat like this: String(sirius_tmp.getTmpDir()).decode('utf-8')

Let me know if that works!

eeko-kon commented 3 years ago

Oh yeah that's exactly what it gives me (b' ') !

I tried String(sirius_tmp.getTmpDir()).decode('utf-8')

but I am getting an AttributeError: 'pyopenms.pyopenms_8.String' object has no attribute 'decode'

oliveralka commented 3 years ago

hm, for me it works with String(sirius_tmp.getTmpDir())

code:

# construct sirius ms file object
msfile = SiriusMSFile()

# fill variables, which are used in the function
argument1 = exp 
# TODO: need to construct the feature mapping 
feature_mapping = FeatureMapping_FeatureToMs2Indices() 
feature_only = True #SiriusAdapterAlgorithm.getFeatureOnly()==True
isotope_pattern_iterations = 3
no_mt_info = False
compound_info = [] #SiriusMSFile_CompoundInfo()

#this is a parameter, which is called "feature_only" 
#It is a boolean value (true/false) and if it is true you are using the  the feature information 
#from in_featureinfo to reduce the search space to MS2 associated with a feature.
#this is recommended when working with featureXML input, if you do NOT use it 
#sirius will use every individual MS2 spectrum for estimation (and it will take ages)
#bool feature_only = (sirius_algo.getFeatureOnly() == "true") ? true : false;
#SiriusAdapterAlgorithm.getNoMasstraceInfoIsotopePattern() == False

print(sirius_tmp.getTmpDir())
print(String(sirius_tmp.getTmpDir()))

msfile.store(exp, 
             String(sirius_tmp.getTmpDir()), # has to be converted to an "OpenMS::String"
             feature_mapping, 
             feature_only,
             isotope_pattern_iterations, 
             no_mt_info, 
             compound_info)

terminal output:

/private/var/folders/t7/x82jn_jd09vc_hlq_sqs2jlw0000gn/T/20210302_173419_Olivers-MBP.fritz.box_18774_1
b'/private/var/folders/t7/x82jn_jd09vc_hlq_sqs2jlw0000gn/T/20210302_173419_Olivers-MBP.fritz.box_18774_1'
.
<ConsensusFeature::computeDechargeConsensus() WARNING: Feature's charge is 0! This will lead to M=0!> occurred 1081 times
..
Warning: a significant portion of your decharged molecules have gapped, even-numbered charge ladders (42 of 422)This might indicate a too low charge interval being tested.
.
<..> occurred 2 times
No MS1 spectrum for this precursor. Occurred 0 times.
0 spectra were skipped due to precursor charge below -1 and above +1.
Mono charge assumed and set to charge 1 with respect to current polarity 696 times.
0 features were skipped due to feature charge below -1 and above +1.

Currently, the feature_mapping is still missing, which has to be performed in the preprocessing step, that is why you do not get any output yet.

oliveralka commented 3 years ago

Ah ok, you updated the script - let me check the newest version.

eeko-kon commented 3 years ago

It is the same. Don't bother :D Alright, I misunderstood. I thought the feature_mapping was constructed in the preprocessing step

oliveralka commented 3 years ago

No, you did not misunderstand. It is constructed in the preprocessing step.

I was currently looking on my branch the one in the PR - and there everything worked with the ".ms" file, based on String(sirius_tmp.getTmpDir()).

I will check your master branch now.

eeko-kon commented 3 years ago

ahh I See :) okok!

oliveralka commented 3 years ago

1) We are using the wrong function :D String(sirius_tmp.getTmpDir()) <- point to the temporary directory, which is used in the QProcessCall String(sirius_tmp.getTmpMsFile()) <- is the correct function to get the temporary ms file

But it seems I still get a segfault and I am not sure why.

eeko-kon commented 3 years ago

yes, I tried that one too, since that was the one I saw in the cpp, but I am also getting the same output :/ segfault 11

oliveralka commented 3 years ago

Ok, I will try to debug it - this might be an issue with the wrapping between c++ and python. So not much you can do at the moment I guess.

With String(sirius_tmp.getTmpMsFile()) is it the exact same error as you posted above?

eeko-kon commented 3 years ago

No, it doesn't actually! It accepts String(sirius_tmp.getTmpMsFile())

Before you try to debug it: I'm running the script with a different file right now (Leupeptin.mzML) and it is running for the past 5min without giving me a segfault! I didn't add any print command so not sure where it is right now but def after the deconvolution step! I ll let you know when I get an output.

oliveralka commented 3 years ago

hm, ok that is good news actually - did you change anything else? Could you update your current master branch?

eeko-kon commented 3 years ago

yes, I updated it! No, nothing else. Only the file! It's so annoying, it's the 2nd time this happens this month.

I think the mzML files are just not very consistent. Not sure what is going on exactly, but when I convert it through my bash shell I am getting seg faults in pyopenms. When I convert it through the UI (proteowizard) I am not usually getting this error, until today.

oliveralka commented 3 years ago

Hm, ok - great that it works with the "Leupeptin.mzML", so you can proceed with your development of the pyopenms pipeline!

I think I will still take a look at the C++ side when I have time with the other mzML, since if the program segfaults, there is usually something wrong. For example, an edge case that is not handled correctly.

eeko-kon commented 3 years ago

Great, I will let you know how it goes! For now, it's still not even preprocessed yet.

oliveralka commented 3 years ago

I think the mzML files are just not very consistent. Not sure what is going on exactly, but when I convert it through my bash shell I am getting segfaults in pyopenms. When I convert it through the UI (proteowizard) I am not usually getting this error, until today.

You could try to use the FileConverter from mzML to mzML after conversion via shell/proteowizard. That is not super convenient, but it might correct issues with the proteowizard files.

eeko-kon commented 3 years ago

Really?? So weird. I will give it a shot. I suspect that if you don't feed it with specific parameters (mscovert Leupeptin.raw --mzML --centroid or something) it just gives you a diverse result. Not sure. I will look into it tomorrow. Thank you for today and have a good night!

oliveralka commented 3 years ago

No Problem! Have a good night!

PS: You have to be sure what kind of spectra are centroided in the conversion process. If you have already measured the MS2 spectra in centroid mode and are centroiding them again in the conversion process this may lead to corrupt MS2 spectra.

eeko-kon commented 3 years ago

Yes, I figured it out when I started building the workflow. Btw, it didn't run. I think it's stuck. I will take a look at it tomorrow step by step and see where the problem is.

oliveralka commented 3 years ago

Ok, it might be worth to check if the default parameters are set for the algorithms, or to set them. Let me know how it goes!

Edit: It seems to work without any issue on the c++ side with

-executable /Users/alka/Documents/work/software/sirius-osx64-4.0.1/bin/sirius
-in /Users/alka/Desktop/tests_and_issues/DTU_Efi/siriusadapter_test/GermicidinAstandard.mzML
-in_featureinfo /Users/alka/Desktop/tests_and_issues/DTU_Efi/siriusadapter_test/devoncoluted_GermicidinAstandard.featureXML
-out_ms /Users/alka/Desktop/tests_and_issues/DTU_Efi/siriusadapter_test/GermicidinAstandard_out_sirius.ms
-converter_mode
-preprocessing:filter_by_num_masstraces  3
-preprocessing:feature_only

GermicidinAstandard_out_sirius.ms.zip

eeko-kon commented 3 years ago

I see. So this could be a python-wrapper issue or a parameter problem?

I opened the file (GermicidinAstandard_out_sirius.ms.zip) and actually the correct mass is missing unfortunately. So I def need to play around with the parameters. Germicidin A is 196.109945 (neutral mass). I can find the M+H exact mass at the FeatureFindingMetabo.featureXML file (197.1185) feature id="f_14619441151854324250". But that's it. I m looking into the deconvoluted.featureXML and it seems actually identical to the FeatureFindingMetabo.featureXML

eeko-kon commented 3 years ago

Ok, I have a theory after a day long of trying to understand what is going on:

The files are mostly ok - the ones I am converting using the MSConvert -GUI.

However, the MetaboliteFeatureDeconvolution step is problematic. It basically generates the exact same file as the FeatureFindingMetabo() or it gets completely stuck when I try a file larger than Germicidin A (e.g. Leupeptin).

At the same time, I tried the workflow in TOPPAS (using the germicidin A file either raw or I converted it in TOPPAS) and again, it crashes at the MetaboliteAddctDecharger step!

15:19:45 ERROR: MetaboliteAdductDecharger crashed!

So strange!

oliveralka commented 3 years ago

What adducts are set in the parameters, when you run the MetaboliteAdductDecharger? It might be that the search space gets too big.

I would suggest, that you optimize the parameters for the feature detection step (FFM) and then try to run MAD again.

The difference between the FFM featureXML and the MAD featureXML should be that adducts are annotated.

Could it be that you store the wrong FeatureMap by mistake?

deconv = MetaboliteFeatureDeconvolution()
f_out = FeatureMap()
cons_map0 = ConsensusMap()
cons_map1 = ConsensusMap()
deconvoluted = deconv.compute(feature_map, f_out, cons_map0, cons_map1)
deconvol = FeatureXMLFile()
deconvol.store("./wf_testing/devoncoluted.featureXML", feature_map) 

should the last step be: deconvol.store("./wf_testing/devoncoluted.featureXML", f_out)

eeko-kon commented 3 years ago

I had deconvol.store("./wf_testing/devoncoluted.featureXML", feature_map)

and switched it to deconvol.store("./wf_testing/devoncoluted.featureXML", f_out)

Just now to check the differences! I will change it back!

And I will take a look at the parameters. That could be the problem!

oliveralka commented 3 years ago
ff = FeatureFindingMetabo()
ff.run(mass_traces_split,
    feature_map,
    mass_traces_filtered)

feature_map is used in the FFM to store the information.

You use that information again, and save the new information in f_out.

deconvoluted = deconv.compute(feature_map, f_out, cons_map0, cons_map1)

This means: f_out should be the correct one, could you compare feature_map and your f_out?

The one filled in the algorithm should have data about the adduct(s).

You can check that easily by using diff file1 file2 in the terminal.

Try to go over the code step by step, in some cases it might help to rename the variables in away that you know where they come from.

e.g. feature_map_ffm; feautre_map_dec

edit: Another option would be to run the tools via the command line or KNIME and then try to reproduce it with the python script, then you see what the output should look like in the first place in if it runs in KNIME/command line it should also run in pyopenms, unless there is a wrapping error.

eeko-kon commented 3 years ago

bash output up to deconvolution:

`Generating Masses with threshold: -8.9872 ...
done

6705 of 17271 valid net charge compomer results did not pass the feature charge constraints

Inferring edges raised edge count from 14308 to 36066

Found 36066 putative edges (of 180481) and avg hit-size of 0.716575

Using solver 'coinor' ...

Optimal solution found!

<Using solver 'coinor' ...> occurred 50 times

 Branch and cut took 40.9798 seconds,  with objective value: 0.511625.

<Optimal solution found!> occurred 50 times

ILP score is: 0.511625

Agreeing charges: 367/2922

ConsensusFeature::computeDechargeConsensus() WARNING: Feature's charge is 0! This will lead to M=0!

<ConsensusFeature::computeDechargeConsensus() WARNING: Feature's charge is 0! This will lead to M=0!> occurred 1081 times
..
Warning: a significant portion of your decharged molecules have gapped, even-numbered charge ladders (42 of 422)This might indicate a too low charge interval being tested.

<..> occurred 2 times`

diff between the files:

` value="[851.384000000000015]"/>
>           <UserParam type="floatList" name="masstrace_centroid_mz" value="[142.119051803574308]"/>
> 37980a47822,47828
>           <UserParam type="int" name="map_idx" value="0"/>
>           <UserParam type="string" name="dc_charge_adducts" value="H1"/>
>           <UserParam type="stringList" name="adducts" value="[[M+H]+]"/>
>           <UserParam type="float" name="dc_charge_adduct_mass" value="1.0078250319"/>
>           <UserParam type="int" name="is_backbone" value="1"/>
>           <UserParam type="int" name="old_charge" value="0"/>
>           <UserParam type="string" name="Group" value="3769097129665771319"/>
> 37982,37985c47830,47833
<       <feature id="f_18258132156436287456">
<           <position dim="0">697.333999999999946</position>
<           <position dim="1">774.59791474306428</position>
<           <intensity>1.285593e04</intensity>
---
>       <feature id="f_4885285558678229880">
>           <position dim="0">854.868000000000052</position>
>           <position dim="1">118.087559968126442</position>
>           <intensity>3.775919e04</intensity>
> 37988c47836
<           <overallquality>3.980958e-05</overallquality>
---
>           <overallquality>1.169248e-04</overallquality>
> 37990,37991c47838,47839
<           <UserParam type="string" name="label" value="T732.2"/>
<           <UserParam type="float" name="FWHM" value="3.439255952835083"/>
---
>           <UserParam type="string" name="label" value="T961.14"/>
>           <UserParam type="float" name="FWHM" value="13.10629940032959"/>
> 37993,37995c47841,47843
<           <UserParam type="floatList" name="masstrace_intensity" value="[1.285592999999993e04]"/>
<           <UserParam type="floatList" name="masstrace_centroid_rt" value="[697.333999999999946]"/>
<           <UserParam type="floatList" name="masstrace_centroid_mz" value="[774.59791474306428]"/>
---
>           <UserParam type="floatList" name="masstrace_intensity" value="[3.775918800000008e04]"/>
>           <UserParam type="floatList" name="masstrace_centroid_rt" value="[854.868000000000052]"/>
>           <UserParam type="floatList" name="masstrace_centroid_mz" value="[118.087559968126442]"/>
> 37997a47846,47847
>           <UserParam type="string" name="Group" value="18297541040978580586"/>
>           <UserParam type="int" name="is_ungrouped_monoisotopic" value="1"/>
37999,38002c47849,47852
<       <feature id="f_8593042559675873728">
<           <position dim="0">704.301000000000045</position>
<           <position dim="1">776.236362935943248</position>
<           <intensity>4.351943e04</intensity>

--- (and a lot more lines)

oliveralka commented 3 years ago
>           <UserParam type="string" name="dc_charge_adducts" value="H1"/>
>           <UserParam type="stringList" name="adducts" value="[[M+H]+]"/>

This should be seen after decovolution. It basically annotated an adduct at for this feature. Which command-line tool are you using?

eeko-kon commented 3 years ago

diff FeatureFindingMetabo.featureXML devoncoluted.featureXML -y

If I look into the file, with ctrl F , I can detect what you wrote to me. So that's good

eeko-kon commented 3 years ago

edit: Another option would be to run the tools via the command line or KNIME and then try to reproduce it with the python script, then you see what the output should look like in the first place in if it runs in KNIME/command line it should also run in pyopenms, unless there is a wrapping error.

I will try that :)

eeko-kon commented 3 years ago

UPDATES:

Sirius works great through KNIME. I am now converting all the files to centroid data using the command line:

msconvert --zlib --filter "peakPicking true [1 ,2]" --ignoreUnknownInstrumentError and the results are consistent. However, I double checked all my script and it looks ok to me, except 2 variables.

This is my train of thought: I am always having the following output:

Generating Masses with threshold: -8.9872 ...
done
1217 of 3702 valid net charge compomer results did not pass the feature charge constraints
Inferring edges raised edge count from 3418 to 8518
Found 8518 putative edges (of 20230) and avg hit-size of 0.674454
Using solver 'coinor' ...
Optimal solution found!
<Using solver 'coinor' ...> occurred 16 times
 Branch and cut took 2.76753 seconds,  with objective value: 1.21756.
<Optimal solution found!> occurred 16 times
ILP score is: 1.21756
Agreeing charges: 97/776
ConsensusFeature::computeDechargeConsensus() WARNING: Feature's charge is 0! This will lead to M=0!
preprocessed
.
<ConsensusFeature::computeDechargeConsensus() WARNING: Feature's charge is 0! This will lead to M=0!> occurred 477 times
..
Warning: a significant portion of your decharged molecules have gapped, even-numbered charge ladders (12 of 117)This might indicate a too low charge interval being tested.
.
<..> occurred 2 times
Number of features to be processed: 80
Number of additional MS2 spectra to be processed: 4169
checked
Segmentation fault: 11

As far as I understand, a segfault will occur when there's a memory issue or when something is divided by zero (or anyway where it doesn't expect zero, it finds zero). I can see from the output: ConsensusFeature::computeDechargeConsensus() WARNING: Feature's charge is 0! This will lead to M=0!> occurred 477 times

Which could be the problem. I searched what ConsensusFeature refers to and it is possibly linked to the preprocessing step and the feature_mapping construction.

KDTreeFeatureMaps fp_map_kd; // reference to *basefeature in vector<FeatureMap> ``FeatureMapping::FeatureToMs2Indices feature_mapping; // reference to *basefeature in vector<FeatureMap>

featureinfo= "./wf_testing/devoncoluted.featureXML"
spectra= exp
v_fp= []
fp_map_kd= KDTreeFeatureMaps()
sirius_algo= SiriusAdapterAlgorithm()
feature_mapping = FeatureMapping_FeatureToMs2Indices() 
sirius_algo.preprocessingSirius(featureinfo,
                                spectra,
                                v_fp,
                                fp_map_kd,
                                sirius_algo,
                                feature_mapping)

I think that feature_mapping needs to be somehow linked to the *basefeature in v_fp vector? Does this make sense? I will look into it.

oliveralka commented 3 years ago

I think the ConsensusFeature in this case has nothing to do with the preprocessing.

The problem with the error message is that it is somehow delayed.

Generating Masses with threshold: -8.9872 ... <- MetaboliteAdductDecharger
done <- MetaboliteAdductDecharger
1217 of 3702 valid net charge compomer results did not pass the feature charge constraints <- MetaboliteAdductDecharger
Inferring edges raised edge count from 3418 to 8518 <- MetaboliteAdductDecharger
Found 8518 putative edges (of 20230) and avg hit-size of 0.674454 <- MetaboliteAdductDecharger
Using solver 'coinor' ... <- MetaboliteAdductDecharger
Optimal solution found! <- MetaboliteAdductDecharger
<Using solver 'coinor' ...> occurred 16 times <- MetaboliteAdductDecharger
 Branch and cut took 2.76753 seconds,  with objective value: 1.21756. <- MetaboliteAdductDecharger
<Optimal solution found!> occurred 16 times <- MetaboliteAdductDecharger
ILP score is: 1.21756 <- MetaboliteAdductDecharger
Agreeing charges: 97/776 <- MetaboliteAdductDecharger
ConsensusFeature::computeDechargeConsensus() WARNING: Feature's charge is 0! This will lead to M=0! <- MetaboliteAdductDecharger
preprocessed
. <- MetaboliteAdductDecharger
<ConsensusFeature::computeDechargeConsensus() WARNING: Feature's charge is 0! This will lead to M=0!> occurred 477 times <- MetaboliteAdductDecharger
.. <- MetaboliteAdductDecharger
Warning: a significant portion of your decharged molecules have gapped, even-numbered charge ladders (12 of 117)This might indicate a too low charge interval being tested. <- MetaboliteAdductDecharger
.<- MetaboliteAdductDecharger
<..> occurred 2 times <- MetaboliteAdductDecharger
Number of features to be processed: 80 <- SiriusAdapter
Number of additional MS2 spectra to be processed: 4169 <- SiriusAdapter
checked
Segmentation fault: 11  <- SiriusAdapter

For example, the statement below is already produced by the checkFeatureSpectraNumber https://github.com/OpenMS/OpenMS/blob/develop/src/openms/source/ANALYSIS/ID/SiriusAdapterAlgorithm.cpp#L296

Number of features to be processed: 80 
Number of additional MS2 spectra to be processed: 4169 

This means that the mapping has worked. I think I would take another look at the error with the Traceback. Could you please let me know how the Traceback looks like. Unfortunately, it is pretty hard to debug the python C++ interface.

Edit: You could also try to reduce the complexity of the dataset for SIRIUS to see if that has an impact by setting the parameter "preprocessing:filter_by_num_masstraces" to 3 (for example.)

eeko-kon commented 3 years ago

Ok I see. Yes I am starting to play around with the parameters now, let's see how this works out :)

oliveralka commented 3 years ago

Please let me know the Traceback of the error you are getting, then I will try to debug it (probably beginning of next week).

eeko-kon commented 3 years ago

Trying to figure it out. I would normally get a traceback error in the output, but I am not getting anything even with calling import traceback traceback.print_tb(tb, limit=None, file=None)

eeko-kon commented 3 years ago

Not getting any traceback Oliver. It's just the segfault and it doesn't allow me to trace the error. If you are talking about the ConsensusFeature, this is a warning, not a traceback. Does that answer your Traceback request? Or am I way off? :D

oliveralka commented 3 years ago

No worries, I am not talking about the ConsensusFeature, but the segfault - as you have guessed correctly.

It is really strange that you are getting a segfault 11 - which would mean out of memory and that it works without any issues in KNIME.

I think you should work on the parameters and I will try to figure out what goes wrong at the interface level.

For me the segfault looks as follows:

Number of features to be processed: 169
Number of additional MS2 spectra to be processed: 506
checked
[1]    88359 segmentation fault  /usr/local/miniconda3/envs/build_pyopenms_39/bin/python 
eeko-kon commented 3 years ago

Yes, I will definitely work on the parameters.

For me it's just : Segmentation fault: 11

Nothing else..

oliveralka commented 3 years ago

Can you check the available memory on the machine and the size of the /tmp directory? Can you also check if there are a lot of .ms files in the /tmp?

eeko-kon commented 3 years ago

The memory shouldn't be an issue, because I ran the workflow in our shared machine last week when this error first appeared and I got the exact same issue. The memory on my Mac is only 8GB and 6,56 GB are being used. I also would think this is the issue. Let me try again on the shared machine.

eeko-kon commented 3 years ago

/PID / USER / PR / NI / VIRT / RES / SHR S / %CPU /%MEM / TIME+ COMMAND / 28962 / eeko / 20 / 0 / 2157688 / 184524 / 57132 R / 329.7 / 0.3 / 0:09.99 python

This is the peak memory usage in the shared machine, which would normally have: MemTotal: 65882684 kB MemFree: 14907788 kB MemAvailable: 49163088 kB

I ran it 4 times and 1 out of 4 I got this error lines before the segmentation fault:

native_id: scanId=1106230 accession: MS:1001508 Could not extract scan number - no valid native_id_type_accession was provided
native_id: scanId=1106330 accession: MS:1001508 Could not extract scan number - no valid native_id_type_accession was provided
native_id: scanId=1106429 accession: MS:1001508 Could not extract scan number - no valid native_id_type_accession was provided
native_id: scanId=61098 accession: MS:1001508 Could not extract scan number - no valid native_id_type_accession was provided
native_id: scanId=61197 accession: MS:1001508 Could not extract scan number - no valid native_id_type_accession was provided
native_id: scanId=61297 accession: MS:1001508 Could not extract scan number - no valid native_id_type_accession was provided
native_id: scanId=63101 accession: MS:1001508 Could not extract scan number - no valid native_id_type_accession was provided
native_id: scanId=63201 accession: MS:1001508 Could not extract scan number - no valid native_id_type_accession was provided
native_id: scanId=63300 accession: MS:1001508 Could not extract scan number - no valid native_id_type_accession was provided
Segmentation fault (core dumped)
oliveralka commented 3 years ago

Hm, ok, which machine are you using and what kind of filetype are the raw files?

eeko-kon commented 3 years ago

It's a server that we have in the group with 62GiB System memory processor: Intel(R) Xeon(R) CPU E5-1660 v4 @ 3.20GHz

eeko-kon commented 3 years ago

The raw files are initially .raw (from an Orbitrap fusion ID-X) and converted to mzML (centroid data). Is that what you mean?

oliveralka commented 3 years ago

Das schaut ganz schön faul aus. Welches example file benutzt du gerade?

GermicidinAstandard.mzML:

print("Loading")
MzMLFile().load("Standards/GermicidinAstandard.mzML", exp)
print("Loaded")

print(exp.getSourceFiles()[0].getNativeIDTypeAccession())
print(exp.getSourceFiles()[0].getNativeIDType())
MS:1000772
Bruker BAF nativeID format
eeko-kon commented 3 years ago

Oh my bad. That one was from bruker :D