MannLabs / directlfq

Fast and accurate label-free quantification for small and very large numbers of proteomes
https://www.mcponline.org/article/S1535-9476(23)00092-0/fulltext
Apache License 2.0
43 stars 5 forks source link

Missing row intensities and entries #17

Open tomthun opened 1 year ago

tomthun commented 1 year ago

Describe the bug Some row entries are 0 for all replicates in the CustomDf.aq_reformat.ion_intensties even though there are valid values in the original CustomDf.aq_reformat input dataframe. Furthermore, there are row entries missing when comparing both .tsv files which i did not expect to happen.

To Reproduce see #16 for input and output files.

Expected behavior

  1. Intensities curated via directLFQ should not be 0 for all replicates and
  2. "CustomDf.aq_reformat.tsv.ion_intensities.tsv" row entry number is the same as from the input data "CustomDf.aq_reformat.tsv"

Version (please complete the following information):

ammarcsj commented 1 year ago

Hi :), can you give me the names of precursors with this behaviour?

tomthun commented 1 year ago

Here are some precursors extracted from the curated CustomDf.aq_reformat.tsv.ion_intensities.tsv for case number 1:

image

And here a sniplet for case 2 (note that these precursors are NOT found in the final output but exist previously in the CustomDf.aq_reformat.tsv):

image

For the latter case i find overall 599 precursers which are missing.

ammarcsj commented 1 year ago

Hi, thanks for looking into this so deeply!

1) so in general I don't use precursors with only a single intensity value downstream. The reason is that directLFQ is a ratio-based method and you cannot calculate ratios with only one value. So even if they are in the .ion_intensities.tsv, they don't end up in the protein. But indeed, I saw some precursors with only one value being reported in the .ion_intensities.tsv while others are set to 0 as you show. So I should fix this, but I don't think it affects protein quantification currently.

2)Indeed in the example for Q15149 it's quite strange, these precursors should be in there. I will have to look into this and fix it in a future release. For the non Q15149 examples it's fine again, as they are only single intensities. I noticed that Q15149 has many many precursors, so a few missing will have virtually no effect on the quantification. Can you check if there are examples for precursors that a)are missing b)have more than one intensity value c) belong to a protein with less than 7 precursors.

Best Constantin

tomthun commented 1 year ago

There are 4 cases with this behaviour: P78527, Q09666, Q14204, Q15149, however none belong to a protein with less than 7 precursors.

Although, there is still the example O00268 with overall 3 precursors (note that of those 3, 2 have only one valid value but for replicate 3) of which one is missing with two valid values:

repli1 repli2 repli3 1534.2 1999.24 nan

I can send you the list with the missing Ions also if you have not created it by yourself by now. Thanks for looking into this! Best,

Tom

tomthun commented 1 year ago

Any updates?

ammarcsj commented 1 year ago

Hi Tom,

I will fix it in the next release, where I will address also a few other things. I cannot tell you exactly when this will be out unfortunately, as I'm currently busy with a few other projects. I will let you know when it is out.

The problem we are talking about is a problem of filtering, so in a few edge cases a bit too much gets filtered out. This definitely needs to be fixed but is very unlikely to hamper any biological analysis. So I think you can go forward with any biological analysis you are doing.

Best Constantin