Error: Assigned data `[[x]] -[[y]]` must be compatible with existing data #190

jfertaj opened 3 years ago

jfertaj commented 3 years ago


I am trying to run a quantification analyses using artMS and get the following error: Error: Assigned data `[[x]] -[[y]]` must be compatible with existing data

The chunk of my code that triggers the error is this:

artmsAnalysisQuantifications(log2fc_file = "results.txt",
                              modelqc_file = "results_ModelQC.txt",
                              species = "human",
                              enrich = TRUE,
                              output_dir = "AnalysisQuantifications_followUP")

my SessionInfo is the following

My keys.txt is attached

Thanks a lot Juan


biodavidjm commented 3 years ago

Hi @jfertaj this is an easy fix. The issue is that you are not currently following guidelines with respect to the bioReplicate notation, i.e, this part:

Condition: The conditions names must follow these rules: Use only letters (A - Z, both uppercase and lowercase) and numbers (0 - 9). The only special character allowed is underscore (_). Very important: A condition name cannot begin with a number (R limitation).

BioReplicate: biological replicate number. It is based on the condition name. Use as prefix the corresponding Condition name, and add as suffix dash (-) plus the biological replicate number. For example, if condition H1N1_06H has too biological replicates, name them H1N1_06H-1 and H1N1_06H-2

which means that, for example, for your condition NV1, instead of this bioreplicate names...


you should have these ones instead


And same thing for all the other conditions. And once you have ready the new keys file, you will have to run the Quantification step again and use the new results files with that new notation.

Next version of artMS will warn the user about this.

Hope it helps David

jfertaj commented 3 years ago

Hi David,

I have corrected the keys.txt file and then I realised than my version of artMS was outdated so I have updated to version v1.10.2. However when running artmsQuantification I have a problem that I have posted before

  Join results in 10010711 rows; more than 764921 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.

Any idea why?

biodavidjm commented 3 years ago


I guess you are referring to this issue. I thought that the issue was resolved. It was not only the wrong notation for conditions and bioreplicates, but also bioreplicates must be unique, and then the run column, which should also be unique. So please, ensure that it is the case. See the attached "corrected" version of your keys.txt file (please, give it a try) keys-fixed.txt

jfertaj commented 3 years ago

Hi David,

I have tried with your version but the error is still there, here is the full output

artMS: BASIC QUALITY CONTROL (evidence.txt based)
--(-) Raw.files in evidence not found in keys file:

-- Plot: correlation matrices
---- by Biological replicates 
---- by Conditions 
-- Plot: intensity stats
<< Basic quality control analysis completed!
artMS: EXTENDED QUALITY CONTROL (-evidence.txt based)
--(-) Raw.files in evidence not found in keys file:

--- Plot PSM done 
--- Plot IONS done 
--- Plot TYPE done 
--- Plot PEPTIDES done 
--- Plot PEPTIDE OVERLAP done 
--- Plot PROTEINS done 
--- Plot PROTEIN OVERLAP done 
--- Plot Plot Ion Oversampling done 
--- Plot Charge State done 
--- Plot Mass Error done 
--- Plot Mass-over-Charge distribution done 
--- Plot Peptide Intensity CV done 
--- Plot Peptide Detection (using modified.sequence) done 
--- Plot Protein Intensity CV done 
--- Plot Protein Detection done 
--- Plot ID overlap done 
--- Plot PCA and Inter-Correlation (WARNING: it might take a long time. Please, be patient)
    (-) Skip peptide-based correlation matrix (too many samples)
    (-) Skip Protein-based correlation matrix (too many samples)
--- Plot Sample Preparation... done
>> QC extended completed
artMS: Relative Quantification using MSstats
>> Reading the configuration file
--(-) Raw.files in evidence not found in keys file:

>> CONVERT Intensity values < 1 to NA
-- Contaminants CON__|REV__ removed
-- Removing protein groups
-- Use <Leading.razor.protein> as Protein ID
-- Selecting Sequence Type: MaxQuant 'Modified.sequence' column
    (+) <Fraction> column added (with value 1, MSstats requirement)
-- Adding NA values for missing values (required by MSstats) 
-- Write out the MSstats input file (-mss.txt) 
>> RUNNING MSstats (it usually takes a 'long' time: please, be patient)
-- Normalization method: equalizeMedians
INFO  [2021-09-13 22:11:26] ** Features with one or two measurements across runs are removed.
INFO  [2021-09-13 22:11:27] ** Fractionation handled.
Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  : 
  Join results in 8506154 rows; more than 757484 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.
In addition: Warning messages:
1: ggrepel: 174 unlabeled data points (too many overlaps). Consider increasing max.overlaps 
2: In RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Set1 is 9
Returning the palette you asked for with that many colors

3: In RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Set1 is 9
Returning the palette you asked for with that many colors

4: ggrepel: 32 unlabeled data points (too many overlaps). Consider increasing max.overlaps 
5: ggrepel: 48 unlabeled data points (too many overlaps). Consider increasing max.overlaps 

Sorry for bothering so much

jfertaj commented 3 years ago

Also, I have created a new keys.txt from scratch, loaded in R and test for unique values like this

> unique(sort(keys$Run))
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
[51] 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
[76] 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
> unique(sort(keys$BioReplicate))
 [1] "10N-V1" "10N-V4" "10P-V1" "10P-V4" "10S-V1" "10S-V4" "11N-V1" "11N-V4"
 [9] "11P-V1" "11P-V4" "11S-V1" "11S-V4" "12N-V1" "12N-V4" "12P-V1" "12P-V4"
[17] "12S-V1" "12S-V4" "13N-V1" "13N-V4" "13P-V1" "13P-V4" "13S-V1" "13S-V4"
[25] "14N-V1" "14N-V4" "14P-V1" "14P-V4" "14S-V1" "14S-V4" "15N-V1" "15N-V4"
[33] "15P-V1" "15P-V4" "15S-V1" "15S-V4" "1N-V1"  "1N-V4"  "1P-V1"  "1P-V4" 
[41] "1S-V1"  "1S-V4"  "2N-V1"  "2N-V4"  "2P-V1"  "2P-V4"  "2S-V1"  "2S-V4" 
[49] "3N-V1"  "3N-V4"  "3P-V1"  "3P-V4"  "3S-V1"  "3S-V4"  "4N-V1"  "4N-V4" 
[57] "4P-V1"  "4P-V4"  "4S-V1"  "4S-V4"  "5N-V1"  "5N-V4"  "5P-V1"  "5P-V4" 
[65] "5S-V1"  "5S-V4"  "6N-V1"  "6N-V4"  "6P-V1"  "6P-V4"  "6S-V1"  "6S-V4" 
[73] "7N-V1"  "7N-V4"  "7P-V1"  "7P-V4"  "7S-V1"  "7S-V4"  "8N-V1"  "8N-V4" 
[81] "8P-V1"  "8P-V4"  "8S-V1"  "8S-V4"  "9N-V1"  "9N-V4"  "9P-V1"  "9P-V4" 
[89] "9S-V1"  "9S-V4" 

The number of unique elements is equals to 90, the original number of samples. I need to remove one after QC but just to do it from scratch I have re-run it with the whole datase

biodavidjm commented 3 years ago

Hi there,

Remember the important rules:

Condition: The conditions names must follow these rules:

BioReplicate: biological replicate number. It is based on the condition name. Use as prefix the corresponding Condition name, and add as suffix dash (-) plus the biological replicate number. For example, if condition H1N1_06H has too biological replicates, name them H1N1_06H-1 and H1N1_06H-2

Have you tried the keys files that I included in my previuos response?

jfertaj commented 3 years ago

Hi David,

Yes, I have tried with the key file you share but still have the same problem. I am going to share with you the folder where I have all the files so you can see if you are able to recreate the error.

Thanks for your help. Regards


Here is the link:

biodavidjm commented 3 years ago

Thanks! I'll take a look and get back to you soon

jfertaj commented 3 years ago

Hi David,

Sorry for bothering you again and on Sunday. Did you have time to see if you were able to replicate my problem?

Thanks a lot Juan

biodavidjm commented 3 years ago

Hi Juan,

sorry for the late response. I've been carefully debugging the issue and I am sorry to report that this is not an artMS issue, but rather an MSstats/data.table one. According to the error message:

INFO  [2021-09-22 08:15:39] ** Features with one or two measurements across runs are removed.
INFO  [2021-09-22 08:15:39] ** Fractionation handled.
Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  : 
  Join results in 8512292 rows; more than 765628 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.

and based on similar errors found on the internet, the error could be solved if the merge function that is called somewhere would include the option allow.cartesian=TRUE (data.table does not use it by default)

I strongly encourage you to report this error in the MSstats google group. Specifically, it fails when running this MSstats function (normalization step):

mssquant = dataProcess(
  raw = dmss,
  logTrans = 2,
  normalization = "equalizeMedians",
  nameStandards = NULL,
  featureSubset = "all",
  remove_uninformative_feature_outlier = FALSE,
  min_feature_count = 2,
  n_top_feature = 3,
  summaryMethod = "TMP",
  equalFeatureVar = TRUE,
  censoredInt = "NA",
  MBimpute = 1,
  remove50missing = FALSE,
  fix_missing = NULL,
  maxQuantileforCensored = 0.999,
  use_log_file = FALSE,
  append = FALSE,
  verbose = TRUE,
  log_file_path = NULL

dmss is the evidence-mss.txt file generated by artMS, you could include it if they ask you for it.

However, let me point something out. It is truly remarkable the low number of proteins identified:

In the evidence file:

> evidence %>% summarise_all(n_distinct)
  Sequence Length Modifications Modified.sequence Oxidation..M..Probabilities Oxidation..M..Score.Diffs Acetyl..Protein.N.term.
1     3658     37             6              3970                        1238                      5903                       2
  Oxidation..M. Missed.cleavages Proteins Leading.proteins Leading.razor.protein Gene.names Protein.names Type Raw.file Experiment
1             4                3      455              383                   340        383           381

After contaminants and protein group removal:

> dmss %>% summarise_all(n_distinct)
  ProteinName PeptideSequence PrecursorCharge FragmentIon ProductCharge IsotopeLabelType Condition BioReplicate Run Fraction
1         308            3519               4           1             1                1         6           90  90        1
1    144852

barely 308 proteins. Is this expected? did you search with the right database? You should include this when asking in the msstats group

Please, let us know how it goes.


J-Sha commented 2 years ago

Hi Juan and David,

I just met the same issue as Juan, also I have a relative small dataset (~300 proteins) for this set of data. I'm wondering did you find a solution for it?

Actually I realized this error just happened after the message of "--- Number of +/- INF values: 344 ", which I think should happened during the imputeMissingValue and merge the original log2FC to the impute steps, here is the full error: " Error: Assigned data[[x]] -[[y]] must be compatible with existing data. x Existing data has 156 rows. x Assigned data has 0 rows. ℹ Only vectors of size 1 are recycled. "

I'm wondering is it possible we can separate the imputation and stats steps? Then maybe we can skip the imputing errors and direct feed the imputed data to perform the stats with MSstats.

Looking forward for your response. Really appreciate it!

Best, Jihui