Transposon names / nomenclature

cavei commented 2 years ago

Dear ExplorATE developer,

I've run ExplorATE shell script on my data. Everything is fine but I am not able to retrieve usable transposon IDs. I've run this

bash ${EXPLORATE} mo -p 12 \ -b /usr/bin/bedtools \ -s /usr/bin/salmon \ -f ${genomefa} \ -g ${gtf} \ -r ${repmaskout} \ -e pe -l ${fastq_dir} -o out_hs -v 'higher_score'

where EXPLORATE=ExplorATE_shell_script/ExplorATE genomefa=Homo_sapiens.GRCh38.dna.primary_assembly.fa gtf=Homo_sapiens.GRCh38.106.chr.gtf fastq_dir=fastqs repmaskout=hg38.fa.out.gz from https://www.repeatmasker.org/genomes/hg38/RepeatMasker-rm405-db20140131/hg38.fa.out.gz

The pipeline proceeds well but for each sample I get back salmon quantification where I have these Ids Name Length EffectiveLength TPM NumReads 11148411676 192 51.000 0.000000 0.000 11167711780 103 2.000 145.084440 1.000 13129231754 462 298.000 0.000000 0.000 13284033037 197 57.572 0.000000 0.000

In the reference.csv file I've got

1;L1MC5a/LINE/L1; 1;MER5B/DNA/hAT-Charlie; 1;MIR3/SINE/MIR; 1;Charlie15a/DNA/hAT-Charlie; 1;L2a/LINE/L2;

in this form.

And these are the only two outputs.

Can you help me out of this?

Thanks

FemeniasM commented 2 years ago

dear @cavei, thank you very much for use ExplorATE. The error occurs because the supplied RepeatMasker file has a different chromosome nomenclature than the .gtf file for its genome. You can fix it with either of these two options: 1) You can use this RepeatMasker file, if you use the reference genome and the .gtf file from the tutorial (recommended).

2) Alternatively, you can use the '-c' argument to input a chromosome alias file: A tab separated file with the first column indicating the desired chromosome name (i.e. names of the .gtf file), and the name to replace in the RepeatMasker file in the second column. If the file contains more columns they will be ignored. This is the least recommended option because you must ensure that the file contains all the chromosome names for the RepeatMasker file to replace, and its proper replacement with exact names matches in the .gtf file.

Finally I want to let you know that we are working to optimize the ExplorATE pipeline for model organisms. This feature will be moved to a separate program (TESSA) in the coming weeks. I will be finalizing the tests in this week and a preliminary version will be available in the course of the next week. I am sure that this new version will be very useful for you, as it reduces execution time, makes data entry easier, and creates a more extensive index. I hope that you can test this new version with your data.

cavei commented 1 year ago

Ok, I'll try with the recommended genome and annotation and I'll wait for TESSA release. Thanks

cavei commented 1 year ago

Hi, sorry to bother you but I've run the pipeline with recommended references and I've ended up with the same result.

This is the quantification Name Length EffectiveLength TPM NumReads chr11150411675 171 37.000 0.000000 0.000 chr11167711780 103 2.000 20.496806 1.000 chr11526415355 91 2.000 0.000000 0.000 chr11890619048 142 20.000 186.638207 91.057 chr11997120405 434 309.542 25.402857 191.817 chr12053020679 149 24.000 0.000000 0.000

but if i want to go back to the location and to the kind of repeated element?

In "reference.csv" I found only this

chr1;L1MC5a/LINE/L1; chr1;MER5B/DNA/hAT-Charlie; chr1;MIR3/SINE/MIR; chr1;L2a/LINE/L2; chr1;L3/LINE/CR1; chr1;Plat_L3/LINE/CR1; chr1;MLT1K/LTR/ERVL-MaLR; chr1;MIR/SINE/MIR;

Is it expected to you? Do you suggest me to wait for TESSA implementation?

FemeniasM commented 1 year ago

Thank you very much for these detailed reports. This result is not expected, an error occurred during the last update. This bug is already fixed in the current version. Please, update the files in the bin folder. I took the opportunity to incorporate some updates that now generate a more extensive reference, and simplify some intermediate steps avoiding make excessive intermediate files. Although indexing may take a few more minutes, this extensive reference should reduce mapping ambiguity at the quantification stage. In the references.csv file (and in the first column of the quant.sf files) you should see the transcripts labeled as: [chromosome]:[start]:[end]:[repName]/[className]/[repFamily] For example, the first row in references.csv could be: chr1:11504:11675:L1MC5a/LINE/L1;L1MC5a/LINE/L1 In addition, a references.bed file is created with the genomic coordinates of each maped transcript. I tested this update with the UCSC references, please let me know if you can run it successfully.

cavei commented 1 year ago

HI, the new pipeline went smoothly to the end with the new transposone nomenclature. Tnks

FemeniasM / ExplorATE_shell_script

Transposon names / nomenclature #1