enpkg / enpkg_full

The full enpkg workflow
GNU General Public License v3.0
3 stars 1 forks source link

Extract metadata directly from the .mzML #6

Open oolonek opened 1 month ago

oolonek commented 1 month ago

We are thinking of transitionig from zthe "sample centric" to an "analysis centric" approaches. This means that there should no longer be /posand /negsubdirectories within the sample dir but rather one unique dir per analysis.

Using matchms or pyteomics we would like to extract the following metadata from a .mzML file

<?xml version="1.0" encoding="utf-8"?>
<indexedmzML xmlns="http://psi.hupo.org/ms/mzml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://psi.hupo.org/ms/mzml http://psidev.info/files/ms/mzML/xsd/mzML1.1.2_idx.xsd">
  <mzML xmlns="http://psi.hupo.org/ms/mzml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://psi.hupo.org/ms/mzml http://psidev.info/files/ms/mzML/xsd/mzML1.1.0.xsd" id="20240307_EB_dbgi_001199_01_01" version="1.1.0">
    <cvList count="2">
      <cv id="MS" fullName="Proteomics Standards Initiative Mass Spectrometry Ontology" version="4.1.56" URI="https://raw.githubusercontent.com/HUPO-PSI/psi-ms-CV/master/psi-ms.obo"/>
      <cv id="UO" fullName="Unit Ontology" version="09:04:2014" URI="https://raw.githubusercontent.com/bio-ontology-research-group/unit-ontology/master/unit.obo"/>
    </cvList>
    <fileDescription>
      <fileContent>
        <cvParam cvRef="MS" accession="MS:1000579" name="MS1 spectrum" value=""/>
        <cvParam cvRef="MS" accession="MS:1000580" name="MSn spectrum" value=""/>
      </fileContent>
      <sourceFileList count="1">
        <sourceFile id="RAW1" name="20240307_EB_dbgi_001199_01_01.raw" location="file:///Y:\public\QE_plus_unifr\raw\2024\03">
          <cvParam cvRef="MS" accession="MS:1000768" name="Thermo nativeID format" value=""/>
          <cvParam cvRef="MS" accession="MS:1000563" name="Thermo RAW format" value=""/>
          <cvParam cvRef="MS" accession="MS:1000569" name="SHA-1" value="505548b2eb00b0490e332020e889f01868f0ab37"/>
        </sourceFile>
      </sourceFileList>
    </fileDescription>
    <referenceableParamGroupList count="1">
      <referenceableParamGroup id="CommonInstrumentParams">
        <cvParam cvRef="MS" accession="MS:1002634" name="Q Exactive Plus" value=""/>
        <cvParam cvRef="MS" accession="MS:1000529" name="instrument serial number" value="Exactive Series slot #1"/>
      </referenceableParamGroup>
    </referenceableParamGroupList>
    <softwareList count="2">
      <software id="Xcalibur" version="2.9-290033/2.9.0.2926">
        <cvParam cvRef="MS" accession="MS:1000532" name="Xcalibur" value=""/>
      </software>
      <software id="pwiz" version="3.0.22105">
        <cvParam cvRef="MS" accession="MS:1000615" name="ProteoWizard software" value=""/>
      </software>
    </softwareList>
    <instrumentConfigurationList count="1">
      <instrumentConfiguration id="IC1">
        <referenceableParamGroupRef ref="CommonInstrumentParams"/>
        <componentList count="4">
          <source order="1">
            <cvParam cvRef="MS" accession="MS:1000073" name="electrospray ionization" value=""/>
            <cvParam cvRef="MS" accession="MS:1000057" name="electrospray inlet" value=""/>
          </source>
          <analyzer order="2">
            <cvParam cvRef="MS" accession="MS:1000081" name="quadrupole" value=""/>
          </analyzer>
          <analyzer order="3">
            <cvParam cvRef="MS" accession="MS:1000484" name="orbitrap" value=""/>
          </analyzer>
          <detector order="4">
            <cvParam cvRef="MS" accession="MS:1000624" name="inductive detector" value=""/>
          </detector>
        </componentList>
        <softwareRef ref="Xcalibur"/>
      </instrumentConfiguration>
    </instrumentConfigurationList>
    <dataProcessingList count="1">
      <dataProcessing id="pwiz_Reader_Thermo_conversion">
        <processingMethod order="0" softwareRef="pwiz">
          <cvParam cvRef="MS" accession="MS:1000544" name="Conversion to mzML" value=""/>
        </processingMethod>
      </dataProcessing>
    </dataProcessingList>

We are thinking of transitionig from zthe "sample centric" to an "analysis centric" approaches. This means that there should no longer be /posand /negsubdirectories within the sample dir but rather one unique dir per analysis.

Using matchms or pyteomics we would like to extract the following metadata from a .mzML file

software

For each value, when possible we retrieve the associated MS: ontology indentifiers

analysis related measure

  • [ ] file checksum

    a1d0c80df517b098fdae862d5722504019b89738
  • [x] Number of scans

oolonek commented 1 month ago

Looking at the mzXML we observe a filechecksum at the end of the file

<fileChecksum>86c35a0d4acb405e1000492f54f9e0fa55cb353b</fileChecksum>

We might use this checksum to identify uniquely the analysis in the graph

oolonek commented 1 month ago

In fact it appears that their is another filw checsum at the beginning

      <sourceFileList count="1">
        <sourceFile id="RAW1" name="20240307_EB_dbgi_001195_01_01.raw" location="file:///Y:\public\QE_plus_unifr\raw\2024\03">
          <cvParam cvRef="MS" accession="MS:1000768" name="Thermo nativeID format" value=""/>
          <cvParam cvRef="MS" accession="MS:1000563" name="Thermo RAW format" value=""/>
          <cvParam cvRef="MS" accession="MS:1000569" name="SHA-1" value="9bb6474f5663680a302fdc55c65b620f422e9d61"/>

We should extract also this one in our metadata file

oolonek commented 1 month ago

@edouardbruelhart So I just had a look and confirm that

  1. the first checksum (see line below) correspond to a hash of the original Thermo RAW file. It is independant of the name given to the .raw file. So this is great news.
    <cvParam cvRef="MS" accession="MS:1000569" name="SHA-1" value="9bb6474f5663680a302fdc55c65b620f422e9d61"/>
  2. the last one corresponds to a hash on the current .mzML This one will be modified when parameters of msconvert are altered
    <fileChecksum>86c35a0d4acb405e1000492f54f9e0fa55cb353b</fileChecksum>