Mesnage-Org / pgfinder

Peptidoglycan MS1 Analysis Tool
https://mesnage-org.github.io/pgfinder
GNU Lesser General Public License v3.0
4 stars 2 forks source link

Improve column names in output files #144

Closed ns-rse closed 1 year ago

ns-rse commented 1 year ago

Currently the columns names in output CSV files are...

ID
xicStart
xicEnd
ionCount
chargeOrder
rt
mwMonoisotopic
theo_mwMonoisotopic
diff_ppm
inferredStructure
maxIntensity

These could be improved

smesnage commented 1 year ago

- Should be sorted in any manner, perhaps by the ID column? I don't think it would help, what matters is to lump all the matches together at the top and the rest after. The example you sent is fine.

- There appear to be a number of rows included where no match has been found, would it be easier/convenient if they are excluded? NO. The masses which cannot be matched to searched structure are the interesting ones, because they are the so-called "peptidoglycan dark matter" (they may or may not be novel PG structures).

Here's the list of the revised names (in the order they should appear, top is left, bottom is right) ID Ion count Charge state XIC start (min) XIC end (min) RT (min) Obs (Da) Theo (Da) ∆ppm Inferred structure Intensity

Format is also a bit of an issue; Modifications would be welcome to avoid manual operations: XIC start 2 DECIMALS ONLY XIC end 2 DECIMALS ONLY RT 2 DECIMALS ONLY ∆ppm 1 DECIMAL ONLY Intensity SCIENTIFIC FORMAT

ns-rse commented 1 year ago

Where do column/field/variable names come from

Many of the column names originate from the ftrs file and/or database at data/first_test_data.ftrs. This sqlite3 database has the following table headers...

ChargeClusters

sqlite> PRAGMA table_info(ChargeClusters);
0|Id|INTEGER|1||1
1|scanIndex|INT|1||0
2|vendorScanNumber|INT|1||0
3|retentionTimeMinutes|REAL|1||0
4|mzFound|REAL|1||0
5|intensity|INT|1||0
6|mwMonoisotopic|REAL|1||0
7|monoOffset|INT|1||0
8|averagineCorrelation|REAL|1||0
9|charge|INT|1||0
10|isotopeCount|INT|1||0
11|scanNoiseFloor|REAL|1||0
12|driftChannel|INT|0||0
13|mobilityScanGroup|INT|0||0
14|mobilityValue|REAL|0||0

Feature Mobilities

sqlite> PRAGMA table_info(FeatureMobilities);
0|Id|INTEGER|1||1
1|feature|INT|1||0
2|charge|INT|1||0
3|mobilityValueStart|REAL|1||0
4|mobilityValueEnd|REAL|1||0

FeatureFinderSettings

sqlite> PRAGMA table_info(FeatureFinderSettings);
0|Id|INTEGER|1||1
1|parameter|TEXT|1||0
2|value|TEXT|1||0

Features

sqlite> PRAGMA table_info(Features);
0|Id|INTEGER|1||1
1|xicStart|REAL|1||0
2|xicEnd|REAL|1||0
3|apexRetentionTimeMinutes|REAL|1||0
4|feature|INT|1||0
5|apexMwMonoisotopic|REAL|1||0
6|maxAveragineCorrelation|REAL|1||0
7|maxIntensity|INT|1||0
8|ionCount|INT|1||0
9|chargeOrder|TEXT|1||0
10|maxIsotopeCount|INT|1||0

Input files also have headers which may be the source of variable/column names. The example maxquant_test_data.txt has the following headers...

❱ head tmp/maxquant_test_data.txt -n1 | sed 's/\t/\n/g'
Raw file
Type
Charge
m/z
Mass
Uncalibrated m/z
Resolution
Number of data points
Number of scans
Number of isotopic peaks
PIF
Mass fractional part
Mass deficit
Mass precision [ppm]
Max intensity m/z 0
Retention time
Retention length
Retention length (FWHM)
Min scan number
Max scan number
Identified
MS/MS IDs
Sequence
Length
Modifications
Modified sequence
Proteins
Score
Intensity
Intensities
Isotope pattern
MS/MS Count
MSMS Scan Numbers
MSMS Isotope Indices

Looking through the code pgio.ftrs_reader() appears to pick most of the features and so they stem from either the database table Features (see above) although this may sometimes be a file input/uploaded by users and so the defaults as well as input files will need changing.

Mapping columns

Mapping columns (order is also indicated by the rows in the table below)...

Current New Source
ID ID Input / Features table
ionCount Ion count Input / Features table
chargeOrder Charge state Input / Features table
xicStart XIC start (min) Input / Features table
xicEnd XIC end (min) Input / Features table
rt RT (min) Input (Retention time)
mwMonoisotopic Obs (Da) Input / Features table
theo_mwMonoisotopic Theo (Da) Derived (pgfinder.pgio.ftrs_reader())
diff_ppm ∆ppm Derived (pgfinder.matching.calculate_ppm_delta())
inferredStructure Inferred structure Derived
maxIntensity Intensity Input / Features table

Order

The order of columns is defined in pgfinder/pgio.ftrs_reader() (line 109)

smesnage commented 1 year ago

Do the comment above await a response? Sounds like you've worked it out??

ns-rse commented 1 year ago

Do the comment above await a response? Sounds like you've worked it out??

Sorry, no response required, I was just making notes for when I get round to making the changes. Looking at this again now.