comprna / SUPPA

SUPPA: Fast quantification of splicing and differential splicing
MIT License
262 stars 62 forks source link

iso_tpm error when formatting #199

Closed igonzalsbmc closed 1 month ago

igonzalsbmc commented 1 month ago

I have this error when:

# Rscript scripts/format_Ensembl_ids.R /mnt/experiments/Bebee-2015-Esrps/suppa/normal/output/salmon_quant/iso_tpm.txt
Error in `.rowNamesDF<-`(x, value = value) : 
  duplicate 'row.names' are not allowed
Calls: rownames<- ... row.names<- -> row.names<-.data.frame -> .rowNamesDF<-
In addition: Warning message:
non-unique values when setting 'row.names':  
Execution halted

The iso_tpm.txt file is:

SRR1725976  SRR1725977
chr1    1787.448015 2555.173720
chr2    1733.993854 2451.723972
chr3    4008.326785 4111.026866
chr4    2942.712200 4013.932420
chr5    2032.149162 2954.419842
chr6    1705.487116 2391.639462
chr7    3674.425678 5047.187459
chr8    1754.060475 2420.988596
chr9    2910.015983 4121.050887
chr10   2398.612721 3157.076040
chr11   9709.153076 12427.557935
chr12   1419.212682 1810.102926
chr13   1425.299671 1936.725587
chr14   1694.807446 2084.559411
chr15   9623.441164 13405.487944
chr16   5734.707472 5132.540952
chr17   2900.047059 4398.731571
chr18   1168.943309 1537.483448
chr19   4488.639602 4828.549501
chrX    911.284786  1230.470648
chrY    18.291910   16.900367
chrM    692165.047768   576805.530158
GL456210.1  0.000000    9.346041
GL456211.1  0.000000    13.121497
GL456212.1  0.000000    20.653315
GL456219.1  0.000000    0.000000
GL456221.1  0.000000    0.000000
GL456233.2  237.359786  524.644412
GL456239.1  0.000000    0.000000
GL456354.1  0.000000    0.000000
GL456359.1  0.000000    0.000000
GL456360.1  0.000000    0.000000
GL456366.1  0.000000    0.000000
GL456367.1  0.000000    0.000000
GL456368.1  178.347317  157.719897
GL456370.1  1950.177329 654.127903
GL456372.1  0.000000    0.000000
GL456378.1  227.678153  352.323035
GL456379.1  0.000000    0.000000
GL456381.1  0.000000    0.000000
GL456382.1  466.587357  1169.060542
GL456383.1  0.000000    0.000000
GL456385.1  0.000000    0.000000
GL456387.1  0.000000    0.000000
GL456389.1  0.000000    0.000000
GL456390.1  0.000000    0.000000
GL456392.1  76.207830   67.391343
GL456394.1  0.000000    0.000000
GL456396.1  0.000000    75.009179
JH584295.1  218961.841053   258513.224351
JH584296.1  0.000000    0.000000
JH584297.1  0.000000    0.000000
JH584298.1  0.000000    0.000000
JH584299.1  0.000000    1.663615
JH584300.1  0.000000    0.000000
JH584301.1  0.000000    0.000000
JH584302.1  0.000000    0.000000
JH584303.1  0.000000    0.000000
JH584304.1  13782.784902    51087.928550
MU069434.1  7335.036039 28055.033348
MU069435.1  577.872300  459.893261
EduEyras commented 1 month ago

Hi,

those are chromosome IDs rather than transcript IDs. You may want to double check how you created the transcript expression file

cheers

Eduardo

igonzalsbmc commented 1 month ago

Now I generate this for example:

SRR1725976  SRR1725977  SRR1725983  SRR1725984
ENSMUST00000119715.2|ENSMUSG00000081622.2|OTTMUSG00000017861.1|OTTMUST00000043230.1|Gm5937-201|Gm5937|643|processed_pseudogene| 0.236249    0.111980    0.288448    0.114749
ENSMUST00000088217.12|ENSMUSG00000025246.14|OTTMUSG00000017892.1|OTTMUST00000043270.1|Tbl1x-201|Tbl1x|5234|protein_coding|  12.498475   18.663619   17.518295   20.179102
ENSMUST00000133893.2|ENSMUSG00000025246.14|OTTMUSG00000017892.1|OTTMUST00000043271.1|Tbl1x-202|Tbl1x|3611|retained_intron|  0.099819    0.355980    0.273538    0.541361
ENSMUST00020183165.1|ENSMUSG00002076902.1|-|-|Gm55070-201|Gm55070|173|snRNA|    1.665572    0.000000    0.000000    0.000000
ENSMUST00000145498.2|ENSMUSG00000087152.4|OTTMUSG00000017882.1|OTTMUST00000043259.1|Cldn34-ps-201|Cldn34-ps|600|processed_pseudogene|   0.000000    0.000000    0.000000    0.000000
ENSMUST00000120309.2|ENSMUSG00000081358.2|OTTMUSG00000017872.1|OTTMUST00000043249.1|Gm14746-201|Gm14746|741|processed_pseudogene|   0.000000    0.000000    0.000000    0.000000
ENSMUST00000119819.2|ENSMUSG00000082842.2|OTTMUSG00000017878.1|OTTMUST00000043255.1|Gm14748-201|Gm14748|411|unprocessed_pseudogene| 0.000000    0.000000    0.000000    0.000000
ENSMUST00000121466.2|ENSMUSG00000081164.2|OTTMUSG00000017868.1|OTTMUST00000043238.1|Gm8722-201|Gm8722|769|processed_pseudogene| 0.044666    0.000000    0.000000    0.000000
ENSMUST00000036333.14|ENSMUSG00000035725.14|OTTMUSG00000017860.1|OTTMUST00000043228.1|Prkx-201|Prkx|4146|protein_coding|    5.665320    11.007346   11.237716   12.147375
ENSMUST00000114044.2|ENSMUSG00000035725.14|OTTMUSG00000017860.1|OTTMUST00000043229.1|Prkx-202|Prkx|2639|protein_coding| 0.011095    0.057386    0.072595    0.141069
ENSMUST00000148284.2|ENSMUSG00000085973.2|OTTMUSG00000017859.1|OTTMUST00000043227.1|Gm14742-201|Gm14742|381|lncRNA| 0.000000    0.000000    0.000000    0.093371
ENSMUST00000000003.14|ENSMUSG00000000003.16|OTTMUSG00000017891.1|OTTMUST00000043268.1|Pbsn-201|Pbsn|902|protein_coding| 0.000000    0.000000    0.000000    0.000000
ENSMUST00000114041.3|ENSMUSG00000000003.16|OTTMUSG00000017891.1|OTTMUST00000043269.1|Pbsn-202|Pbsn|697|protein_coding|  0.000000    0.000000    0.000000    0.000000
ENSMUST00000114039.3|ENSMUSG00000079522.3|OTTMUSG00000017866.1|OTTMUST00000043236.1|Gm14744-201|Gm14744|800|protein_coding| 0.000000    0.000000    0.041894    0.000000

But when I try to produce the iso_tpm_formatted file I get this error (I put above the duplicated entry for ENSMUSG00000000003.16’):

root@7d43c7ce68f4:/opt/SUPPA# Rscript scripts/format_Ensembl_ids.R /mnt/experiments/Bebee-2015-Esrps/suppa/iso_tpm.txt
Error in `.rowNamesDF<-`(x, value = value) : 
  duplicate 'row.names' are not allowed
Calls: rownames<- ... row.names<- -> row.names<-.data.frame -> .rowNamesDF<-
In addition: Warning message:
non-unique values when setting 'row.names': ‘ENSMUSG00000000003.16’, ‘ENSMUSG00000000028.16’, ‘ENSMUSG00000000031.19’, ‘ENSMUSG00000000037.18’, ‘ENSMUSG00000000049.12’, ‘ENSMUSG00000000056.8’, ‘ENSMUSG00000000058.7’, ‘ENSMUSG00000000078.8’, ‘ENSMUSG00000000085.17’, ‘ENSMUSG00000000088.8’, ‘ENSMUSG00000000094.13’, ‘ENSMUSG00000000103.13’, ‘ENSMUSG00000000126.12’, ‘ENSMUSG00000000127.16’, ‘ENSMUSG00000000131.16’, ‘ENSMUSG00000000134.18’, ‘ENSMUSG00000000142.16’, ‘ENSMUSG00000000148.18’, ‘ENSMUSG00000000149.11’, ‘ENSMUSG00000000154.17’, ‘ENSMUSG00000000157.17’, ‘ENSMUSG00000000159.17’, ‘ENSMUSG00000000167.15’, ‘ENSMUSG00000000168.11’, ‘ENSMUSG00000000184.13’, ‘ENSMUSG00000000194.14’, ‘ENSMUSG00000000197.9’, ‘ENSMUSG00000000202.10’, ‘ENSMUSG00000000204.17’, ‘ENSMUSG00000000214.12’, ‘ENSMUSG00000000215.12’, ‘ENSMUSG00000000223.14’, ‘ENSMUSG00000000244.18’, ‘E [... truncated] 
Execution halted
igonzalsbmc commented 1 month ago

Apparently, the error is in fact in the format_Ensembl_ids.R. Shouldn't we take the first value when you split by |?

EduEyras commented 3 weeks ago

I reckon the script is expecting a different format or number of IDs.

igonzals commented 1 week ago

For me the code that worked was:

#!/usr/bin/Rscript
# With this script, running from bash, we want to format the ids for the Ensembl transcripts for running SUPPA

# Parse command line arguments
CHARACTER_command_args <- commandArgs(trailingOnly=TRUE)
# CHARACTER_command_args[1] <- "/projects_rg/SCLC_cohorts/George/Salmon/v2/iso_tpm.txt"

if (length(CHARACTER_command_args)== 1){
  file <- read.table(file=CHARACTER_command_args[1])
  ids <- unlist(lapply(rownames(file),function(x)strsplit(x,"\\|")[[1]][1]))
  rownames(file) <- ids
  write.table(file,file=paste0(substr(CHARACTER_command_args[1],1,nchar(CHARACTER_command_args[1])-4),"_formatted.txt"),
              quote=FALSE, row.names=TRUE,col.names=TRUE,sep="\t")
}