error running make_se_from_files using diann pg matrix

jflucier commented 3 months ago

Hi,

When pass my DIANN result file to the make_se_from_files function, it returns the following error:

Error in `.rowNamesDF<-`(x, value = value) : 
  duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘A0A075B5M4’, ‘A0A075B5M7’, ‘A0A075B5N3’, ‘A0A075B5N4’, ‘A0A075B5R7’, ‘A0A075B5T2’, ‘A0A075B5Y4’, ‘A0A075B666’, ‘A0A087WRA4’, ‘A0A087WS16’, ‘A0A0A6YYP6’, ‘A0A0B4J1I0’, ‘A0A0B4J1M0’, ‘A0A0J9YVH3’, ‘A0A0R4J2B2’, ‘A0A140LIF8’, ‘A0A571BF69’, ‘A2A4P0’, ‘A2A5R2’, ‘A2A8L1’, ‘A2AAY5’, ‘A2AB59’, ‘A2ADY9’, ‘A2APV2’, ‘A2AQ07’, ‘A2ASS6’, ‘A2BH40’, ‘A2CG49’, ‘A2CG63’, ‘A3KFU5’, ‘A8DUK4’, ‘B1ARD6’, ‘B2RSH2’, ‘B2RY04’, ‘B9EJ86’, ‘D3YWQ0-2’, ‘D3YXK2’, ‘D3Z3J6’, ‘D3Z6Q9’, ‘E9PUM5’, ‘E9PVA6’, ‘E9PZM4’, ‘E9Q166’, ‘E9Q1A5’, ‘E9Q1F2’, ‘E9Q1P8’, ‘E9Q448’, ‘E9Q512’, ‘E9QA15’, ‘F8VPU6’, ‘G5E829’, ‘G5E8K5’, ‘G5E8V9’, ‘O08528’, ‘O08638’, ‘O08664’, ‘O08797’, ‘O08807’, ‘O08900’, ‘O08911’, ‘O09106’, ‘O09110’, ‘O35226’, ‘O [... truncated]

I have trace back by executing line by line the make_se_from_files function and found where the error happens. It happens in the make_unique function that returns duplicates. If I inspect the returned proteins_unique object, the returned ID is truncated in the case where proteins groups are composed of multiple proteins. For example:

Diann protein group: ID A0A075B5M4: A0A075B5M4 A0A075B5M4;A0A0A6YYE7: A0A075B5M4

Would it be ok to prefilter diann results to remove all lines where I see a group of more then 1 protein like A0A075B5M4;A0A0A6YYE7 or it will bias results.

Thank you in advance for your help, JF

hsiaoyi0504 commented 3 months ago

@jflucier Is your DIA-NN input file generated by running FragPipe?

hsiaoyi0504 commented 3 months ago

Also, which version you are using?

jflucier commented 3 months ago

Hi

Is your DIA-NN input file generated by running FragPipe?

No, I run DIANN using command line on a linux cluster. Here is the command I use:

diann --threads 40 --verbose 2 \
--f $SLURM_TMPDIR/data/Fjolla_DIA_15KO_1_Slot1-32_1_24034.d \
--f $SLURM_TMPDIR/data/Fjolla_DIA_15KO_2_Slot1-33_1_24036.d \
--f $SLURM_TMPDIR/data/Fjolla_DIA_15KO_3_Slot1-34_1_24038.d \
--f $SLURM_TMPDIR/data/Fjolla_DIA_15KO_4_Slot1-35_1_24040.d \
--f $SLURM_TMPDIR/data/Fjolla_DIA_LysM_minus_1_Slot1-36_1_24046.d \
--f $SLURM_TMPDIR/data/Fjolla_DIA_LysM_minus_2_Slot1-37_1_24048.d \
--f $SLURM_TMPDIR/data/Fjolla_DIA_LysM_minus_3_Slot1-38_1_24050.d \
--f $SLURM_TMPDIR/data/Fjolla_DIA_LysM_minus_4_Slot1-39_1_24052.d \
--f $SLURM_TMPDIR/data/Fjolla_DIA_LysM_plus_1_Slot1-40_1_24055.d \
--f $SLURM_TMPDIR/data/Fjolla_DIA_LysM_plus_2_Slot1-41_1_24057.d \
--f $SLURM_TMPDIR/data/Fjolla_DIA_LysM_plus_3_Slot1-42_1_24059.d \
--f $SLURM_TMPDIR/data/Fjolla_DIA_LysM_plus_4_Slot1-43_1_24061.d \
--f $SLURM_TMPDIR/data/Fjolla_DIA_WT_1_Slot1-28_1_24025.d \
--f $SLURM_TMPDIR/data/Fjolla_DIA_WT_2_Slot1-29_1_24027.d \
--f $SLURM_TMPDIR/data/Fjolla_DIA_WT_3_Slot1-30_1_24043.d \
--f $SLURM_TMPDIR/data/Fjolla_DIA_WT_4_Slot1-31_1_24031.d \
--temp $SLURM_TMPDIR/temp \
--cut K*,R* --missed-cleavages 2 --met-excision \
--fasta "$SLURM_TMPDIR/UP000000589_10090_combo.fasta" --fasta-search \
--out-lib "$SLURM_TMPDIR/out/report-lib.tsv" --out-lib-copy \
--out "$SLURM_TMPDIR/out/report.tsv" \
--mass-acc-ms1 20 --mass-acc 20 \
--min-pep-len 7 --max-pep-len 30 \
--min-pr-charge 1 --max-pr-charge 5 \
--min-pr-mz 100 --max-pr-mz 1700 \
--min-fr-mz 100 --max-fr-mz 1500 \
--predictor --reanalyse --matrices --smart-profiling --pg-level 1 \
--unimod4 --unimod35 --var-mod UniMod:1,42.010565,*n,ntermacetyl

Also, which version you are using?

I use DIANN v1.8.1 installed inside a singularity container built using docker image.

Thank you again for your help

hsiaoyi0504 commented 3 months ago

I was asking about the version of FragpipeAnalystR. I believe FragPipe doesn't generate report with such issue. We are willing to support DIA-NN report more but currently we don't support that yet. If you are willing to share your file, you can send it to me through email yihsiao@umich.edu

jflucier commented 3 months ago

The FragPipeAnalystR version installed is 0.1.7

I will send you my analysis file directly to the provided email

Thanks again!

jflucier commented 2 months ago

Hello,

I manage to get this working by filtering pg report using only proteotypic proteins groups (those without ; in protein_group name). Here is the command I used to filter:

perl -ne '
chomp($_);
my @t = split("\t",$_);
my @prot_ident = split(";",$t[0]);
if(scalar(@prot_ident) == 1){
  print $_ . "\n";
}
' report.pg_matrix.tsv > report.pg_matrix.proteoptypic.tsv

Afterwards, the following command run with success:

ccrcc <- make_se_from_files(
  "report.pg_matrix.proteoptypic.tsv",
  "experiment_annotation.tsv",
  type = "DIA",
  level = "gene"
)

Nesvilab / FragPipeAnalystR

error running make_se_from_files using diann pg matrix #14