HCGB-IGTP / XICRA

Small RNAseq pipeline for paired-end reads
MIT License
7 stars 3 forks source link

understanding the miRNA output #10

Closed mars188 closed 3 years ago

mars188 commented 3 years ago

Hello,

I analysed the small RNA data using XICRA and it worked. So, I have now output file that's a little bit confusing for me. Here is how it looks like:

ID sample1 sample2 sample3 .... hsa-let-7a-3p&NA&iso-21-4L2MUYWZD 21 5 13 hsa-let-7a-3p&iso_3p:+1&iso-22-4L2MUYWZJ 55 64 101 hsa-let-7a-3p&iso_3p:+4,iso_add3p:2,iso_5p:+3,iso_snv_central&iso-22-UOW5X7Y7M 10 4 22 hsa-let-7a-3p&iso_3p:+4,iso_snv_central,iso_5p:+3,iso_add3p:2&iso-22-UOW5X7Y7N 34 46 40 hsa-let-7a-3p&iso_3p:+4,iso_snv_seed,iso_add3p:2,iso_5p:+3&iso-22-UXW597Y7N 35 3 15

So, main identified small RNA is "hsa-let-7a-3p"? and other part of the name are isoforms of "hsa-let-7a-3p"? and numbers are the read counts? How do we interpret these results?

Many thanks in advance!

JFsanchezherrero commented 3 years ago

Hi there,

The output of the miRNA module is a count table where ID is a unique entry with counts for each sample in columns. You can input this table into any differential expression analysis package (e.g. DESeq2, limma, etc).

Here is your example provided:

ID sample1 sample2 sample3
hsa-let-7a-3p&NA&iso-21-4L2MUYWZD 21 5 13
hsa-let-7a-3p&iso_3p:+1&iso-22-4L2MUYWZJ 55 64 101
hsa-let-7a-3p&iso_3p:+4,iso_add3p:2,iso_5p:+3,iso_snv_central&iso-22-UOW5X7Y7M 10 4 22
hsa-let-7a-3p&iso_3p:+4,iso_snv_central,iso_5p:+3,iso_add3p:2&iso-22-UOW5X7Y7N 34 46 40
hsa-let-7a-3p&iso_3p:+4,iso_snv_seed,iso_add3p:2,iso_5p:+3&iso-22-UXW597Y7N 35 3 15

XICRA generates this table containing information at the isoform miRNA (isomiR) level (variant modifications of the canonical miRNAs as described in miRBase) and the ID column, contains all available information for each entry (separated by '&').

For example, take the ID = _hsa-let-7a-3p&iso_3p:+1&iso-22-4L2MUYWZJ,_ it contains information of the miRNA (hsa-let-7a-3p), the modification from the original sequence in mirBase (iso_3p:+1) and the Unique Identifier (UID, unique sequence plate as generated by miRTOP). See here:

miRNA variant type UID
hsa-let-7a-3p iso_3p:+1 iso-22-4L2MUYWZJ

Variant annotation is generated categorizing isomiRs into classes based on their sequence modifications: iso_5p, iso_3p, iso_add, iso_snv, iso_snv_seed, iso_snv_central_offset, iso_snv_central, iso_snv_central_suppl, following miRTOP suggested classification scheme. NA variants correspond to canonical isomiRs. See additional information on each variant in the miRTOP original publication https://pubmed.ncbi.nlm.nih.gov/31504201/.

In the example you provided here for hsa-let-7a-3p miRNA, the most abundant isoform is not the cannonical but the modified isomiR iso_3p+1.

XICRA.stats is an R package that allows you to use the matrix table generated either at the isomiR, miRNA canonical or a variant type.

See details here: https://github.com/HCGB-IGTP/XICRA.stats You basically install it using devtools and using as follows:

# Install XICRA.stats version from GitHub:
# install.packages("devtools")
devtools::install_github("HCGB-IGTP/XICRA.stats")

data_XICRA <- XICRA.stats::parse_XICRA(path_to_matrix_file)

data_XICRA is a list containing dataframes with the data group at different levels: isomiR, miRNA canonical or variant.

See an example using XICRA.stats for the data analysis we use in the original publication of XICRA.stats here: https://github.com/HCGB-IGTP/XICRA/blob/master/BMC_bioinformatics_paper/code/GSE114923_DESeq2_analysis.R

Best regards

mars188 commented 3 years ago

Thank you soooo much for the nice explanation. Now, I understand how output looks like. Just have two small (may be) question:

  1. The numbers 21, 5, 13 are the counts, representing the abundance of that miRNA in sample1, 2 and 3, right?
  2. What if I want to know only about miRNA and don't care about the isoform? Can I use the first part like "hsa-let-7a-3p" alog with its readcount (abundance) for differential expression or any other downstream analysis? Or that would be wrong and this protocol is specifically for knowing about the isoform?
JFsanchezherrero commented 3 years ago

Hi there,

You are welcome! I hope you find it useful.

I would answer your questions.

  1. You are right, numbers (e.g. 21, 5 and 13) are raw counts for each isomiR and each sample. In this case, using the software you specify in the miRNA module (e.g. miraligner), counts were obtained for each isomiR and sample, format was standarized using miRTop.

  2. Yes, definitely. This is what I meant when I recommended to use XICRA.stats to group counts for the same miRNA or the same variant.

Basically do: data_XICRA <- XICRA.stats::parse_XICRA(path_to_matrix_file)

And as a results, data_XICRAis a list containing dataframes (isomir_data, miRNA_dataand variant_data) with the data grouped at different levels: isomiR, miRNA canonical or variant type.

You can also use packages such as tidyr or dplyrto separate IDs (by &) and group_by counts at the miRNA level or the desired level.

Following the second question, you can just use the data at the miRNA level but take into account some new discoveries regarding the robustness and better discrimiation of results using isomiRs vs miRNA https://pubmed.ncbi.nlm.nih.gov/31882820/ https://pubmed.ncbi.nlm.nih.gov/28206648/

On the other hand, take into account potential problems regarding validation and amplification of single isomiRs as non template variants might be present and PCR primers might produce low performance.

I would recommend to make the analysis at both levels, at the miRNA and the isomiR level and check for consistencies of results and putative candidates.

Best regards!

mars188 commented 3 years ago

Thank you so much. You are very knowledgeable and been extremely useful in my analysis.

Ok, I will conduct XCIRA.stat analysis and update you if I need any further help.

Thank youuuu again for all your time!

JFsanchezherrero commented 3 years ago

Hi there, You are welcome. I am glad I could help you in your analysis.

Do not doubt to contact me for further details if necessary. I am going to close the issue so far, please reopen it or a new one if you require so.

Best regards Jose F. Sanchez

mars188 commented 3 years ago

This may be an R question but you may be able to help me with this too:

I tried the following command in R:

data_XICRA <- XICRA.stats::parse_XICRA(path_to_matrix_file) It worked. When I run "data_XICRA" I can see the rows and columns in R studio.

However, I tried to export/save the data with write.csv or write.table as below: write.csv(data_XICRA, "miRNA_results.csv") It gives me the following error:

Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 183767, 1866, 102686

And it doesn't write the output file. Any idea how to fix this?

mars188 commented 3 years ago

I google searched this issue but couldn't find the solution :(

JFsanchezherrero commented 3 years ago

Hi there,

The thing is the R object data_XICRAis a list of 3 dataframes:

data_XICRA$isomir_data
data_XICRA$mirna_data
data_XICRA$variant_data

that is why it generates the error when you try to save in a file.

Basically, the differences among the 3 dataframes are:

Try to find the differences by doing:

head(data_XICRA$isomir_data)
head(data_XICRA$mirna_data)
head(data_XICRA$variant_data)

Regards, Jose

mars188 commented 3 years ago

I just completed the analysis and obtained nice and clean miRNA as well as variant data files. Thank you sooooo much. You have been so helpful. Also, the pipeline is very useful too.

Best regards,