OpenMS converter does not handle "cleavage agent" = "not applicable" correctly

davidecarlson commented 1 year ago

Hi All,

I'm trying to run parse_sdrf convert-openms on my SDRF file, but I'm getting an error that I don't fully understand, particularly since the validation tool doesn't indicate any obvious issues with my input file. Here are the commands I've run along with the output:

[~]$ parse_sdrf validate-sdrf --sdrf_file my_SDRF.tsv
Everything seems to be fine. Well done.

[~]$ parse_sdrf convert-openms -t2 -l -s my_SDRF.tsv
PROCESSING: my_SDRF.tsv"
Factor columns: ['factor value[phenotype]']
Characteristics columns (those covered by factor columns removed): ['characteristics[organism]', 'characteristics[organism part]', 'characteristics[cell type]', 'characteristics[disease]', 'characteristics[biological replicate]']
Error: 'NoneType' object has no attribute 'group'

Here are the first five lines of my input file. I can attach the full thing if it would be useful, but the issue arises even using only this portion of the input.

source name characteristics[organism]   characteristics[organism part]  characteristics[cell type]  characteristics[disease]    characteristics[phenotype]  characteristics[biological replicate]   assay name  comment[instrument] comment[cleavage agent details] comment[fraction identifier]    comment[label]  comment[technical replicate]    comment[data file]  factor value[phenotype]
Sample dE4  Francisella tularensis subsp. holarctica LVS    not available   not available   not available   pilE4_deletion  1   Run 1   Q Exactive HF   not applicable  1   label free sample   1   /gpfs/projects/GenomicsCore/proteomics/DeRosa/CLICK-E-F-LVS/Expt1/dE4_NLL1_20220107235651.raw   pilE4_deletion
Sample dE4  Francisella tularensis subsp. holarctica LVS    not available   not available   not available   pilE4_deletion  1   Run 2   Q Exactive HF   not applicable  1   label free sample   2   /gpfs/projects/GenomicsCore/proteomics/DeRosa/CLICK-E-F-LVS/Expt1/dE4_NLL1_20220108224853.raw   pilE4_deletion
Sample dE4  Francisella tularensis subsp. holarctica LVS    not available   not available   not available   pilE4_deletion  1   Run 3   Q Exactive HF   not applicable  1   label free sample   3   /gpfs/projects/GenomicsCore/proteomics/DeRosa/CLICK-E-F-LVS/Expt1/dE4_NLL1.raw  pilE4_deletion
Sample dE4  Francisella tularensis subsp. holarctica LVS    not available   not available   not available   pilE4_deletion  1   Run 4   Q Exactive HF   not applicable  1   label free sample   4   /gpfs/projects/GenomicsCore/proteomics/DeRosa/CLICK-E-F-LVS/Expt1/dE4_NLL2_20220108014816.raw   pilE4_deletion

Any ideas on what I'm doing wrong?

Thank you! Dave

davidecarlson commented 1 year ago

Note that I asked about this on the nf-core/quantms slack as well. But since the issue occurs even when run manually outside of the pipeline, I figured maybe I should ask here.

ypriverol commented 1 year ago

I will double check and go back to you today. Can you provide the full SDRF?

davidecarlson commented 1 year ago

Thanks a lot! I'm attaching the full SDRF (file extension changed from tsv to txt for Github compatibility).

I've very new to MS-based proteomics, so it's very possible I've introduced some sort of error with the SDRF file.

my_SDRF.txt

ypriverol commented 1 year ago

Before testing, can I ask you why not PTMs are added in the SDRF?

davidecarlson commented 1 year ago

The only reason is that I'm processing data collected by someone else, and I was not given any information regarding post-translation modifications.

ypriverol commented 1 year ago

Normally Oxidation of methionine is allowed as variable and almost 99% of the cases, Carbamidomethyl C is also allowed.

davidecarlson commented 1 year ago

Thanks. I will add those variables.

Do you think that is related to the error I'm getting, or is including just more best practices?

Thanks! Dave

ypriverol commented 1 year ago

Both. It is good because Im almost 100% sure you will need them and It can be the source of the error. We dont have a good error message system in the released version of the sdrf-pipelines but we are working to improved it.

davidecarlson commented 1 year ago

Okay, thanks. I will add that and report back on the result.

davidecarlson commented 1 year ago

I've added two new protein modification columns:

comment[modification parameters] comment[modification parameters] NT=Oxidation;MT=Variable;TA=M;AC=UNIMOD:35 NT=Carbamidomethyl;AC=UNIMOD:4;TA=C;MT=Fixed

However, I'm still seeing the same error message when running parse_sdrf:

[~]$ parse_sdrf validate-sdrf --sdrf_file my_SDRF.tsv
Everything seems to be fine. Well done.
[~]$ parse_sdrf convert-openms -t2 -l -s my_SDRF.tsv
PROCESSING: DeRosa_sdrf.tsv"
Factor columns: ['factor value[phenotype]']
Characteristics columns (those covered by factor columns removed): ['characteristics[organism]', 'characteristics[organism part]', 'characteristics[cell type]', 'characteristics[disease]', 'characteristics[biological replicate]']
Error: 'NoneType' object has no attribute 'group'

Any additional suggestions?

My updated SDRF file is attached.

Thanks! Dave my_SDRF.txt

davidecarlson commented 1 year ago

Okay, I was able to track down the error.

It seems that, contrary to the docs, the value of comment[cleavage agent details] cannot be set to "not applicable".

After replacing "not applicable" with a placeholder value (in this case NT=Trypsin; AC=MS:1001251; CS=(?⇐[KR])(?!P)), the error goes away.

In my case, I am not certain if a cleavage agent was used or not, but I will try to find out from the core facility that generated the data.

Best, Dave

jpfeuffer commented 1 year ago

Good find! I believe this should be a bug then. I just renamed the title.

davidecarlson commented 1 year ago

Okay, good to know that it's a bug. In that case, I believe the error is introduced here:

https://github.com/bigbio/sdrf-pipelines/blob/f19d38ca2b6d51cce3cab9c3f8921ad2219fee80/sdrf_pipelines/openms/openms.py#L321

The re.search assumes that the value in the comment[cleavage agent details] column includes an "NT". If the regex search returns an empty result, the group method fails and throws an error.

jpfeuffer commented 1 year ago

But also the docs are misleading and should be reworked. It says the "NT" part is mandatory, so I am confused myself, how one would specify "no cleavage". Would it be "NT=not applicable" or just "not applicable". In any case, even if it could parse "NT=not applicable" it will probably fail, since "not applicable" is not part of the mapping in https://github.com/bigbio/sdrf-pipelines/blob/f19d38ca2b6d51cce3cab9c3f8921ad2219fee80/sdrf_pipelines/openms/openms.py#L77

ping @ypriverol

ypriverol commented 1 year ago

Not applicable is valid for SDRF but not for the processing with quantms because all pipelines supported now in quantms are enzyme specific.

jpfeuffer commented 1 year ago

For you @davidecarlson this also means you could emulate a no cleavage behaviour by specifying "NT=No cleavage".

@ypriverol They are only tested with enzymes for now but I don't see anything speaking against not using an enzyme. All search engines support this.

davidecarlson commented 1 year ago

Thanks, guys!

From my perspective, this can be closed. But I will leave it open in case there is more to discuss.

I appreciate the assistance.

Best, Dave

ypriverol commented 1 year ago

@jpfeuffer I have tested already the pipeline without enzyme and it doesn't work. I can try to find the issue. Both things do not work, no enzyme or multiple enzymes.

ypriverol commented 1 year ago

@davidecarlson keep us posted with your results from quantms, we want to see how the pipeline works for others. Thanks for using the workflow.

ypriverol commented 1 year ago

I will close the issue.

bigbio / sdrf-pipelines

OpenMS converter does not handle "cleavage agent" = "not applicable" correctly #147