rsemToTrainingMatrix - Githubissues

amcruise commented 3 years ago

Hi Camden,

I am working on implementing SOMatic on allot of different chip samples. The ability to create and link an RNA-seq SOM to the ChIP-SOM will really take this analysis to the next level. I am having trouble implementing the rsemToTrainingMatrix.sh script

It says to input with this syntax: ./rsemToTrainingMatrix.sh ../Path/to/regex/rsem/files/*.rsem ../path/to/output/sample.list ../path/to/output/TrainingMatrix

could you provide a clear working example of the expected inputs/ directory structures? When I run the program with my RSEM files, it does nothing, if mess up the regex expression for the Rsem files, it will print out an error saying something like: cannot find file '.rsem'.

I would appreciate any help here!

csjansen commented 3 years ago

Hi,

The matrix building script uses "ls -1v $1" to gather all of the rsem files. Can you confirm that ls -1v ../Path/to/regex/rsem/files/*.rsem writes out the proper filenames?

-Camden

csjansen commented 3 years ago

Could you also confirm that the rsem output you are linking to is properly tab delimited and that the TPM values are in the 6th column.

amcruise commented 3 years ago

Thank you for your quick reply Camden, I will check the RSEM outputs and get back to you!

amcruise commented 3 years ago

I checked: ls -1v path/to/rsem

And it correctly lists the file names.

This is the format of the RSEM files:

gene_id transcript_id(s) length effective_length expected_count TPM FPKM
ENSG00000000003.10 ENST00000373020.4_TSPAN6-001,ENST00000494424.1_TSPAN6-002,ENST00000496771.1_TSPAN6-003 2206.00 2136.39 345.00 16.21 11.58

The script is able to locate them and populate a sample list but it does not output a training matrix. I am unsure why. :(

amcruise commented 3 years ago

I tried to create my own training matrix:

ENSG00000230043.1 1.81 0 12.695 39.135 28.835 12.035 6.14 0 ENSG00000252200.1 0 7.8 0 32.205 6.555 0 0 0 ENSG00000182916.7 33.925 8.86 14.81 37.96 0.14 0 0.09 0.045 ENSG00000105472.8 4.425 9.365 9.78 21.505 0.025 0 0.02 0 ENSG00000133636.6 1.475 0.6 0.105 88.55 0.33 0.35 0.285 0.325 ENSG00000216077.1 0 0 0 18.69 0 0 0 0 ENSG00000169908.6 28.84 17.975 20.79 51.575 0.115 0.155 0.105 0.19 ENSG00000265455.1 0 10.41 6.035 17.295 4.75 5.505 1.975 0 ENSG00000145040.3 1 4.045 0.915 14.88 0.13 1.15 1.59 0 ENSG00000165092.8 132.235 56.93 60.52 43.305 0.39 2.25 1.89 0.21 ENSG00000266241.1 0 2.815 0 13.185 0 0 0 0 ENSG00000091664.7 5.98 1.6 4.405 69.92 1.335 0.195 0.97 0.435

While it was able to train the SOM and build the site, it seems that it is not amenable to downstream metaclustering analysis.

csjansen commented 3 years ago

I have been running into a strange problem recently in which the unit code fails for some unknown reason and the unit files become empty. Can you check the files in SOMName/data/som/units to see if they are all 0 size? "ls -l". If so, I'll work on a fix today.

amcruise commented 3 years ago

It is not actually not zero sized! So just to double check, the training matrix is composed of rows of genes and columns of TPMs for each sample?

csjansen commented 3 years ago

Yes, it looks like the matrix you made is of the right format. It is tab delimited yes? What part of the metaclustering code is failing?

amcruise commented 3 years ago

Ok awesome! Yes it is tab delimited. I am trying to retrace my steps from earlier, lost track of what I did exactly. It is currently stopped at "pushing threads".

amcruise commented 3 years ago

On second thought, this is where it crashed earlier, but it seems to be working now:

Cluster Num: 5 Pushing threads 70724.5 Cluster Num: 6 Pushing threads

Will keep you posted!

amcruise commented 3 years ago

Hey Camden,

Are there any issues that you know could cause this corrupted SOM? Would appreciate any clarification here. Thanks!

csjansen commented 3 years ago

This could have been caused by a full 0 vector in the training matrix or if you have 1 sample that is driving all the differences. I'd highly recommend log correcting your RNA data. log2(x+1)

amcruise commented 3 years ago

So I tried using the log 2 option. It seems to work all except for one sample. which returns 0 on all of its units. We ran the samples in duplicate so there is another SOM to see what we might expect. Is there any reason that this may be happening to this specific sample? It is the first sample in the sample list. I am not sure why it does not work. The training matrix also contains values for all replicates so in theory, there should at least be something there.

csjansen commented 3 years ago

Ya, I'm not quite sure why that could be happening to the first data set. It would make sense if it were the last and you didn't have the right number of columns in your training matrix (aka a missing one). Could you check to make sure that a GFP_1.map file exists in the data/som directory? Does it have real data in it? It also could be possible that the website has been lodged into your browser cache if you are using the same path as the previous site. You may need to go to WebsitePath/data/som/GFP_1.map and do a shift refresh and see if the numbers change to what is actually on the server.

No matter how many tags I put on the site to not cache anything, Chrome does it anyway sometimes.,

amcruise commented 3 years ago

I caught the issue, so for some reason, while making the training matrix, i double tabbed the first column so it made the first sample contain 0 values. I think it is functional now. Quick question though, how is the intensity for each sample's individual unit calculated. Here are two screenshots, the first contains the intensity value for two samples. the second contains the TPM values for each respective sample and each gene represented in the unit. How are the relative intensities calculated? I would appreciate any clarification. see row 39 column 29!

csjansen commented 3 years ago

It's a linear scale between the 10% (blue) and 90% (Red) value. You can set any scale you'd like for any selected maps by using the "groups" tab. Give the group of selected maps a name and you can set the blue and red value.

csjansen commented 3 years ago

The numbers themselves are the neuron coordinates in n-dimensions after training. They represent whatever scale you used in your training matrix (tpm, rpkm, whatever)

csjansen commented 3 years ago

Also, if you used the Log2 option, the values are log2(x+1) corrected.

amcruise commented 3 years ago

Awesome! that makes sense. Thank you so much for your help!

csjansen commented 3 years ago

No problem. Let me know if I can help futher.

csjansen / SOMatic

rsemToTrainingMatrix #7