JiekaiLab / scTE

MIT License
97 stars 27 forks source link

Custom reference for non-human, non-mouse genome #3

Closed hswhitbeck closed 3 years ago

hswhitbeck commented 3 years ago

Hi, the ReadMe file says "If you want to use your customs reference, you can use the -gene -te options:". We understood this as being able to use your code on other genomes than the mouse and the human. We tried this command to build the index: scTE_build -te /path/to/hsal_v8.5_filtered_unique_ids.bed -gene /path/to/hsal_v8.5_genes_update16.gtf -o /path/to/scTE_build_1.idx and we got the following error message: scTE_build: error: the following arguments are required: -g/--genome In the ReadMe file example the -g argument is not supplied for building a custom index. Why is it required? Any tips are appreciated. Thank you.

jphe commented 3 years ago

We have update scTE with more speices' genome included, and the the -g is optional now if the bed/gtf file were given.

bsierieb1 commented 3 years ago

Thanks for your reply @jphe We have downloaded the updated version of scTE and now get another error: ERROR : Counting genome other not supported We work with an exotic non-model species of insects. Could you please help us generate a custom index for our genome? Would it be possible to share the genome with you so that you could include it in the next update? If this is too much work for you, maybe you could guide us through the process and let us do it ourselves? Thanks a lot!

jphe commented 3 years ago

For non-model species you need to make sure it has well annotated files for TEs and genes.

As GitHub has a strict file limit of 100MB, and the genmoe indices usually much bigger than that, so we can not upload the geome indices to the Github.

If you have we accessible ftp or any other web accessible tools, you can share the annotation files for us then we build the indices and send to you

bsierieb1 commented 3 years ago

here are the genome and the annotation files.

thank you so much for your help!

bsierieb1 commented 3 years ago

P.S. you should be able to use the same link to upload the indices. please let us know if there is any issue!

jphe commented 3 years ago

There are only the gtf file for genes under the ftp, while scTE also needs an annotation file for TEs.

The gene annotation gtf file seems derived from transcript assembly, however, we did not recommend for such file as there are many TE derived transcripts, which will leads to underestimate of TE expression if you use scTE for quantification, as scTE assign reads to genes/transcripts first, and then for TEs.

Besides, usually the transcripts assembly highly depends on bulk RNA-seq data, while development and disease process are highly heterogenous, the transcripts from the rare cell types are often masked by bulk RNA-seq, which means the transcript assembly from bulk RNA-seq data may unreliable for the analysis of the rare cell types from single-cell.

May be you can try the strategy of this paper if you want to use the assembled transcripts, which quantifies the expression of TE derived transcripts https://genome.cshlp.org/content/early/2020/12/21/gr.265173.120.abstract

bsierieb1 commented 3 years ago

sorry, i accidentally copied a link to one of the files instead of the link to the entire drive folder. here is the correct link.

the gene annotations file is not derived from a transcriptome assembly, but i wonder what made you think that? the gene annotations were generated by the NCBI annotation pipeline and further updated by incorporating additional RNA-seq data. the TE annotations are simply the output of RepeatMasker (edited to remove some classes of short features).

bsierieb1 commented 3 years ago

hi @jphe, do you think you have everything you need? thank you for offering help!

jphe commented 3 years ago

Sorry for the late reply, we can not interpretate properly, we don't know what it means for each column, as it seems not a classical gtf file. Basically you need to convert it into a canonical gtf format for the gtf file. Or you can check if Ensemble has the gtf file for the genome, it should be canonical gtf format in Ensemble.

akui113 commented 3 years ago

Hi, the ReadMe file says "If you want to use your customs reference, you can use the -gene -te options:". We understood this as being able to use your code on other genomes than the mouse and the human. We tried this command to build the index: scTE_build -te /path/to/hsal_v8.5_filtered_unique_ids.bed -gene /path/to/hsal_v8.5_genes_update16.gtf -o /path/to/scTE_build_1.idx and we got the following error message: scTE_build: error: the following arguments are required: -g/--genome In the ReadMe file example the -g argument is not supplied for building a custom index. Why is it required? Any tips are appreciated. Thank you.

@jphe I also encountered the same problem,and the species is Macaca mulatta. gene annotation file was downloaded from http://ftp.ensembl.org/pub/release-104/gtf/macaca_mulatta/Macaca_mulatta.Mmul_10.104.gtf.gz, and repeatmask file was downloaded from http://hgdownload.soe.ucsc.edu/goldenPath/rheMac10/database/rmsk.txt.gz .

Then, I treated the repeatmask file and get a six-column bed file with the option awk 'BEGIN{FS=OFS="\t"}{print $6,$7,$8,$11,$3,$10}' rmsk.txt > mmul10rmsk.bed and make sure the chromosome name consistent with gene annotation file. Lastly, I building the index scTE_build -te mmul10rmsk.bed -gene Macaca_mulatta.Mmul_10.104.gtf -o Mmul_10scTE.idx. However, I get the ERROR : Counting genome other not supported.

Any tips are appreciated ! Thank you for your generous help!

antecede commented 1 year ago

Hello team of authors and thank you for your beautiful work! Could you please write a guide process so that others can create their own custom references for non-model species, so that we can get the results file in a timely manner while reducing your work! Thanks again! best wishes!

antecede commented 1 year ago

Sorry for the late reply, we can not interpretate properly, we don't know what it means for each column, as it seems not a classical gtf file. Basically you need to convert it into a canonical gtf format for the gtf file. Or you can check if Ensemble has the gtf file for the genome, it should be canonical gtf format in Ensemble.

If the research is on non modal species, there is no canonical gtf in Ensemble. If convenient, please provide the non Ensemble gtf or how to supplement the missing column content to obtain a custom reference.

BR