Closed amcruise closed 3 years ago
Hi,
The matrix building script uses "ls -1v $1" to gather all of the rsem files. Can you confirm that ls -1v ../Path/to/regex/rsem/files/*.rsem writes out the proper filenames?
-Camden
Could you also confirm that the rsem output you are linking to is properly tab delimited and that the TPM values are in the 6th column.
Thank you for your quick reply Camden, I will check the RSEM outputs and get back to you!
I checked: ls -1v path/to/rsem
And it correctly lists the file names.
This is the format of the RSEM files:
gene_id transcript_id(s) length effective_length expected_count TPM FPKM
ENSG00000000003.10 ENST00000373020.4_TSPAN6-001,ENST00000494424.1_TSPAN6-002,ENST00000496771.1_TSPAN6-003 2206.00 2136.39 345.00 16.21 11.58
The script is able to locate them and populate a sample list but it does not output a training matrix. I am unsure why. :(
I tried to create my own training matrix:
ENSG00000230043.1 1.81 0 12.695 39.135 28.835 12.035 6.14 0 ENSG00000252200.1 0 7.8 0 32.205 6.555 0 0 0 ENSG00000182916.7 33.925 8.86 14.81 37.96 0.14 0 0.09 0.045 ENSG00000105472.8 4.425 9.365 9.78 21.505 0.025 0 0.02 0 ENSG00000133636.6 1.475 0.6 0.105 88.55 0.33 0.35 0.285 0.325 ENSG00000216077.1 0 0 0 18.69 0 0 0 0 ENSG00000169908.6 28.84 17.975 20.79 51.575 0.115 0.155 0.105 0.19 ENSG00000265455.1 0 10.41 6.035 17.295 4.75 5.505 1.975 0 ENSG00000145040.3 1 4.045 0.915 14.88 0.13 1.15 1.59 0 ENSG00000165092.8 132.235 56.93 60.52 43.305 0.39 2.25 1.89 0.21 ENSG00000266241.1 0 2.815 0 13.185 0 0 0 0 ENSG00000091664.7 5.98 1.6 4.405 69.92 1.335 0.195 0.97 0.435
While it was able to train the SOM and build the site, it seems that it is not amenable to downstream metaclustering analysis.
I have been running into a strange problem recently in which the unit code fails for some unknown reason and the unit files become empty. Can you check the files in SOMName/data/som/units to see if they are all 0 size? "ls -l". If so, I'll work on a fix today.
It is not actually not zero sized! So just to double check, the training matrix is composed of rows of genes and columns of TPMs for each sample?
Yes, it looks like the matrix you made is of the right format. It is tab delimited yes? What part of the metaclustering code is failing?
Ok awesome! Yes it is tab delimited. I am trying to retrace my steps from earlier, lost track of what I did exactly. It is currently stopped at "pushing threads".
On second thought, this is where it crashed earlier, but it seems to be working now:
Cluster Num: 5 Pushing threads 70724.5 Cluster Num: 6 Pushing threads
Will keep you posted!
Hey Camden,
Are there any issues that you know could cause this corrupted SOM? Would appreciate any clarification here. Thanks!
This could have been caused by a full 0 vector in the training matrix or if you have 1 sample that is driving all the differences. I'd highly recommend log correcting your RNA data. log2(x+1)
So I tried using the log 2 option. It seems to work all except for one sample. which returns 0 on all of its units. We ran the samples in duplicate so there is another SOM to see what we might expect. Is there any reason that this may be happening to this specific sample? It is the first sample in the sample list. I am not sure why it does not work. The training matrix also contains values for all replicates so in theory, there should at least be something there.
Ya, I'm not quite sure why that could be happening to the first data set. It would make sense if it were the last and you didn't have the right number of columns in your training matrix (aka a missing one). Could you check to make sure that a GFP_1.map file exists in the data/som directory? Does it have real data in it? It also could be possible that the website has been lodged into your browser cache if you are using the same path as the previous site. You may need to go to WebsitePath/data/som/GFP_1.map and do a shift refresh and see if the numbers change to what is actually on the server.
No matter how many tags I put on the site to not cache anything, Chrome does it anyway sometimes.,
I caught the issue, so for some reason, while making the training matrix, i double tabbed the first column so it made the first sample contain 0 values. I think it is functional now. Quick question though, how is the intensity for each sample's individual unit calculated. Here are two screenshots, the first contains the intensity value for two samples. the second contains the TPM values for each respective sample and each gene represented in the unit. How are the relative intensities calculated? I would appreciate any clarification. see row 39 column 29!
It's a linear scale between the 10% (blue) and 90% (Red) value. You can set any scale you'd like for any selected maps by using the "groups" tab. Give the group of selected maps a name and you can set the blue and red value.
The numbers themselves are the neuron coordinates in n-dimensions after training. They represent whatever scale you used in your training matrix (tpm, rpkm, whatever)
Also, if you used the Log2 option, the values are log2(x+1) corrected.
Awesome! that makes sense. Thank you so much for your help!
No problem. Let me know if I can help futher.
Hi Camden,
I am working on implementing SOMatic on allot of different chip samples. The ability to create and link an RNA-seq SOM to the ChIP-SOM will really take this analysis to the next level. I am having trouble implementing the rsemToTrainingMatrix.sh script
It says to input with this syntax: ./rsemToTrainingMatrix.sh ../Path/to/regex/rsem/files/*.rsem ../path/to/output/sample.list ../path/to/output/TrainingMatrix
could you provide a clear working example of the expected inputs/ directory structures? When I run the program with my RSEM files, it does nothing, if mess up the regex expression for the Rsem files, it will print out an error saying something like: cannot find file '.rsem'.
I would appreciate any help here!