Closed cavei closed 1 year ago
dear @cavei, thank you very much for use ExplorATE. The error occurs because the supplied RepeatMasker file has a different chromosome nomenclature than the .gtf file for its genome. You can fix it with either of these two options: 1) You can use this RepeatMasker file, if you use the reference genome and the .gtf file from the tutorial (recommended).
2) Alternatively, you can use the '-c' argument to input a chromosome alias file: A tab separated file with the first column indicating the desired chromosome name (i.e. names of the .gtf file), and the name to replace in the RepeatMasker file in the second column. If the file contains more columns they will be ignored. This is the least recommended option because you must ensure that the file contains all the chromosome names for the RepeatMasker file to replace, and its proper replacement with exact names matches in the .gtf file.
Finally I want to let you know that we are working to optimize the ExplorATE pipeline for model organisms. This feature will be moved to a separate program (TESSA) in the coming weeks. I will be finalizing the tests in this week and a preliminary version will be available in the course of the next week. I am sure that this new version will be very useful for you, as it reduces execution time, makes data entry easier, and creates a more extensive index. I hope that you can test this new version with your data.
Ok, I'll try with the recommended genome and annotation and I'll wait for TESSA release. Thanks
Hi, sorry to bother you but I've run the pipeline with recommended references and I've ended up with the same result.
This is the quantification Name Length EffectiveLength TPM NumReads chr11150411675 171 37.000 0.000000 0.000 chr11167711780 103 2.000 20.496806 1.000 chr11526415355 91 2.000 0.000000 0.000 chr11890619048 142 20.000 186.638207 91.057 chr11997120405 434 309.542 25.402857 191.817 chr12053020679 149 24.000 0.000000 0.000
but if i want to go back to the location and to the kind of repeated element?
In "reference.csv" I found only this
chr1;L1MC5a/LINE/L1; chr1;MER5B/DNA/hAT-Charlie; chr1;MIR3/SINE/MIR; chr1;L2a/LINE/L2; chr1;L3/LINE/CR1; chr1;Plat_L3/LINE/CR1; chr1;MLT1K/LTR/ERVL-MaLR; chr1;MIR/SINE/MIR;
Is it expected to you? Do you suggest me to wait for TESSA implementation?
Thank you very much for these detailed reports. This result is not expected, an error occurred during the last update. This bug is already fixed in the current version. Please, update the files in the bin
folder. I took the opportunity to incorporate some updates that now generate a more extensive reference, and simplify some intermediate steps avoiding make excessive intermediate files. Although indexing may take a few more minutes, this extensive reference should reduce mapping ambiguity at the quantification stage. In the references.csv
file (and in the first column of the quant.sf
files) you should see the transcripts labeled as:
[chromosome]:[start]:[end]:[repName]/[className]/[repFamily]
For example, the first row in references.csv
could be: chr1:11504:11675:L1MC5a/LINE/L1;L1MC5a/LINE/L1
In addition, a references.bed
file is created with the genomic coordinates of each maped transcript.
I tested this update with the UCSC references, please let me know if you can run it successfully.
HI, the new pipeline went smoothly to the end with the new transposone nomenclature. Tnks
Dear ExplorATE developer,
I've run ExplorATE shell script on my data. Everything is fine but I am not able to retrieve usable transposon IDs. I've run this
bash ${EXPLORATE} mo -p 12 \ -b /usr/bin/bedtools \ -s /usr/bin/salmon \ -f ${genomefa} \ -g ${gtf} \ -r ${repmaskout} \ -e pe -l ${fastq_dir} -o out_hs -v 'higher_score'
where EXPLORATE=ExplorATE_shell_script/ExplorATE genomefa=Homo_sapiens.GRCh38.dna.primary_assembly.fa gtf=Homo_sapiens.GRCh38.106.chr.gtf fastq_dir=fastqs repmaskout=hg38.fa.out.gz from https://www.repeatmasker.org/genomes/hg38/RepeatMasker-rm405-db20140131/hg38.fa.out.gz
The pipeline proceeds well but for each sample I get back salmon quantification where I have these Ids Name Length EffectiveLength TPM NumReads 11148411676 192 51.000 0.000000 0.000 11167711780 103 2.000 145.084440 1.000 13129231754 462 298.000 0.000000 0.000 13284033037 197 57.572 0.000000 0.000
In the reference.csv file I've got
1;L1MC5a/LINE/L1; 1;MER5B/DNA/hAT-Charlie; 1;MIR3/SINE/MIR; 1;Charlie15a/DNA/hAT-Charlie; 1;L2a/LINE/L2;
in this form.
And these are the only two outputs.
Can you help me out of this?
Thanks