galaxyproject / training-material

A collection of Galaxy-related training material
https://training.galaxyproject.org
MIT License
297 stars 868 forks source link

Access to "REFERENCE_refSeq_genes_33k_TE_126.fa" in srna tutorial.md #525

Closed drosofff closed 1 year ago

drosofff commented 6 years ago

@malloryfreeberg Hi Mallory,

It's not clear from https://github.com/galaxyproject/training-material/blame/50518e9db648d8ac77ea7ab34f9caf595a78621b/topics/transcriptomics/tutorials/srna/tutorial.md#L252 how we can access to the dataset that contains fasta sequence of 33k mRNA and TEs (not in zenedo). This is an issue because it remains uncertain whether there is canonical TE consensus sequences in this file or instead a collection of real TE sequences (complete or incomplete); this in turns has impact on the downstream statistical analysis. Best Christophe

malloryfreeberg commented 6 years ago

Hi @drosofff,

The full reference fasta file of 33k mRNA and TEs is not needed to complete the tutorial; the dm3_transcriptome_sequences_downsampled.fa.gz file in the zenodo entry is sufficient.

To clear up the uncertainty, the reference list of TEs is a set of canonical transposable element sequences. Specifically, I used the reference list of 126 transposable elements published as part of the piPipes suite of tools for analyzing piRNAs and transposon. The authors of these tools are experts in small RNA and transposons studies. The specific link to the dm3 reference files is here and the file named dm3.transposon.fa is the one from which the TE sequences in the tutorial are taken. I believe the original source of these canonical transposon sequences is from the highly manually curated set of TE sequences (found here) from the Drosophila Genome Project.

The full mRNA reference list is simply the RefSeq entries, which can be imported directly into Galaxy via the UCSC Table Browser, or separately downloaded from another source and imported into Galaxy. Again, for the purposes of the tutorial, the required reference mRNA sequences are provided in zenodo.

I hope this description clears things up. Do let me know if you have any additional questions.

drosofff commented 6 years ago

@malloryfreeberg Thanks Mallory, it's all clear 👍 . Good to see these links that remind me interesting times ! Why not stating in the tutorial that the reference used for siRNA profiling is dm3_transcriptome_sequences_downsampled.fa.gz? The illustration is misleading, right ?

drosofff commented 6 years ago

@malloryfreeberg it's me again. Sorry to bother you with my questions: In the tutorial, you suggest that profiling genomic items that contain repeated sequences is complicated using the whole genome as reference for alignment. All right. But the reference that you are using contains indeed both canonical transposons and several piRNA clusters identified by Brennecke. These latter ones are composed of intermingled copies of transposons whose sequences will undoubtedly match the sequences of your canonical transposons.

Thus, in a way, your reference also contains repeated sequences and I was wondering how salmon is influenced by this feature: typically, your read counts should be different depending of the clusters you included in the reference together with the canonical transposons (for instance, the Flam cluster is enriched in zam and idefix, but not 42AB). Do you see what I mean ?

hexylena commented 4 years ago

@mvdbeek Bérénice mentioned you have a new version of this tutorial that might address this issue?

hexylena commented 1 year ago

Closing due to original issue being solved sufficiently, additional questions may be out of scope.