bvaldebenitom / SoloTE

GNU General Public License v3.0
27 stars 6 forks source link

Interpreting the results table #18

Closed yeroslaviz closed 1 year ago

yeroslaviz commented 1 year ago

Hi Braulio,

again thanks a lot for the help with the analysis here and taking the time. Now i have ran pipeline al the way to the end and in the last object I do get some SoloTE items in my list of significant genes:

E.g.

> markers_cellranger_sig_te
                                                              p_val avg_log2FC pct.1 pct.2     p_val_adj cluster
SoloTE-chr4-1271971-1272134-TART-A:Jockey:LINE--       2.231053e-23  0.3385400 0.134 0.028  5.033925e-19       0
SoloTE-Gypsy1-I-DM:Gypsy:LTR                           2.291321e-13  0.2580726 0.629 0.453  5.169908e-09       0
SoloTE-Gypsy1-I-DM:Gypsy:LTR1                          6.605104e-12  0.2773599 0.596 0.458  1.490310e-07       1
SoloTE-DM1731-LTR:Copia:LTR                           3.868694e-132  0.8135736 0.832 0.260 8.728934e-128       2

I was wondering now, if you can explain to me how to interpret the results. For that i have a few questions as well as a suggestion

  1. in the first line for example I have TART-A:Jockey:LINE. What does it means? How do I know which gene/transposome are represented here?
  2. How do I interpret the naming of the results (as stated also in the bed file)? The First part, e.g.TART-A is the element, masked on the genome. the second part, e.g. Jockey:LINE is the family.
  3. Is there a way to see, how many reads from each of the T-elements were found/mapped? Can I see them in my cunt matrix? Do I understand it correctly, that counts for each of the "SoloTE-*" are not the genomic position of this specific gene/elemny, But also (or only) its identified hits on other places in the genome?
  4. Is there a way to figure out, where the hits I have are coming from?
  5. How does the tool analyse the input files I gave it. I used the fastA file from the repeatmasker (RM), converted as a bed file. The RM file contain many more entries as I can see in the bed file. => Can you please explain to me which rows/entries are kept and which are discarded? Do I only keep those entries belonging to TE and their families?

The names are too long and back-searching in the input bed file is not possible, as the | symbol was changed. is there a way to make that easier?

thanks for the help again.

Assa

bvaldebenitom commented 1 year ago

Hi @yeroslaviz,

1. in the first line for example I have TART-A:Jockey:LINE. What does it means? How do I know which gene/transposome are represented here?

The TE identifier is built in this manner: Subfamily (or Repeat Instance):Family:Class

So for that example, you would have a TART-A element of the Jockey family that belongs to the LINE Class of TEs.

2. How do I interpret the naming of the results (as stated also in the bed file)? The First part, e.g.TART-A is the element, masked on the genome. the second part, e.g. Jockey:LINE is the family.

Please see the above reply.

3. Is there a way to see, how many reads from each of the T-elements were found/mapped? Can I see them in my cunt matrix? Do I understand it correctly, that counts for each of the "SoloTE-*" are not the genomic position of this specific gene/elemny, But also (or only) its identified hits on other places in the genome?

Considering the particular TE indicated above, when they have "chr" in their name, it means that it corresponds to a specific copy of that element in the genome. If you don't see the chromosome coordinates in the identifier, it means that the reads could map to many copies of that TE, and the counts are aggregated at the subfamily level.

Using the count matrix, you could use something like

te_matrix_indices <- grep("TART-A:Jockey:LINE",rownames(matrix))
matrix[te_matrix_indices,]

This should give the entries in the matrix corresponding to that same TE, and you could check whether other entries in the matrix are at the locus-specific level or not.

4. Is there a way to figure out, where the hits I have are coming from?

As described above, if you see chromosome coordinates, it means that the reads come from that genomic location. If not, then they are not accurately mapped to a genomic location and summarized at the subfamily level.

5. How does the tool analyse the input files I gave it. I used the fastA file from the repeatmasker (RM), converted as a bed file. The RM file contain many more entries as I can see in the bed file. => Can you please explain to me which rows/entries are kept and which are discarded? Do I only keep those entries belonging to TE and their families?

If you use the conversion script, it will only keep bona fide TEs of the classes DNA, LTR, LINE and SINE. The RepeatMasker file contains also the information of non-TE elements, such as Simple Repeats and Satellites.

Hope this helps.

EDIT: Regarding the last point, you can use awk like this to generate a new BED file of the marker TEs:

awk 'BEGIN{FS="-";OFS="\t"}{print $2,$3,$4,$0}' te_markers > te_markers.bed

This will result in a BED file that you can use for downstream analysis.

yeroslaviz commented 1 year ago

sorry for the late response, as I was on vacation 🛫, but I want to say a big thank you for taking the timer and for the very detailed explanation to my questions. It really did help.

I will now close this, but if I'll have some more questions will re-open this ticket.

Again, thanks a lot

Assa

EDIT: PS. regarding your awk command. Do I understand it correctly, that the first te_markers file is the output from RepeatMasker?