bvaldebenitom / SoloTE

GNU General Public License v3.0
23 stars 6 forks source link

Can I use SoloTE to quantify human endogenous retrovirus in Single cell #4

Closed biohiran closed 1 year ago

biohiran commented 1 year ago

I have a fes query on the scope of using SoloTE for expression analysis of human ERVs in single cell data, keeping in mind that there are very few tools to use for this objective in the single cell space.

  1. Considering hERVs as a one class of Transposable elements, it does looks like could use SoloTE for this purpose. What is your thought on this ? is there anything that you think would affect the confidence of the results for hERVS?

  2. Transposable elements could be repeated at different locus/region in genome. That means biologicall same reads can be mapped to multiple loci. If we use "outSAMattributes' as i to bam, technically we are limiting to one potential alignment , but discarding other biologically possible mapping . is that correct?

bvaldebenitom commented 1 year ago

Hi @biohiran,

  1. You can use it for hERVs by utilizing as input only the annotation of those elements, or you can filter them from the matrix once you get the complete result.
  2. This is correct for those reads that map to multiple locations. In SoloTE we aim to report locus-specific expression of TEs when the reads align uniquely. If this is not the case, expression is summarized at the family level. Because of this, there is a trade off in allowing more alignments, which may artificially make the subfamily estimations larger.

Regarding the second point, we are working on applying the EM algorithm to further increase the TEs detected with locus resolution.

QuietgraceH commented 1 year ago

Hello. I am using the output results (barcode.tsv genes.tsv matrix.mtx) for single cell analysis. But there are more than 900,000 features in my gene.tsv file, is this normal?When I run this command, it takes a long time to output the results. This may be due to data redundancy. Will this phenomenon affect the clustering results? The command is listed: pbmc <-FindVariableFeatures(pbmc,selection.metthod="vst",nfeatures=2000) The pbmc is listed in R: image

Bests,

bvaldebenitom commented 1 year ago

Hi @QuietgraceH,

It looks like you are not getting any errors now! (related to Issue #8).

Regarding this question, by default the pipeline reports all TEs with at least 1 UMI. However, from practical experience keeping only those TEs with at least 3, and in some cases, with at least 10 UMIs, reduced considerably the number of TEs in the matrix.

Please let me know if you need any help regarding this pre-processing step, as it can be done in R or when using the pipeline. The first option is more flexible as it allows you to compare the impact of several UMI thresholds on the number of detected TEs.

QuietgraceH commented 1 year ago

Thanks. My previous problem has been solved. I deleted the irrelevant content of the bam format file and it can be run to get the result. Now there is a new problem. I want to use the results of the soloTE tool for cluster analysis, but the gene.tsv file has too many features, which is because of the addition of chromosome positions. Since I want to analyze in units of subfamily, regardless of chromosomal position. As in the example below, I hope to combine the two data into one data to see the expression level. So how to solve this problem. example: SoloTE|10|119523612|119524306|L1MB7:L1:LINE|- SoloTE|10|119524316|119525516|L1MB7:L1:LINE|-

bvaldebenitom commented 1 year ago

Hi @QuietgraceH,

you can generate a subfamily version of the results using the following command: awk 'BEGIN{FS=OFS="\t"}{if($1 ~ /SoloTE\|chr/){split($1,a,"|"); $1=a[1]"|"a[5]}; print $1,$2,$3 }' *_allcounts_final.txt > table_bysubf.txt where *_allcounts_final.txt should be replaced with the appropriate name of the final file generated in the SoloTE temp directory.

Then, use the attached script in the following way: Rscript create_subfamily_matrix_v2.R table_bysubf.txt

Upon completion, you will see a new directory named Matrix_Subfamily with the locus counts summarized by subfamily.

create_subfamily_matrix.tar.gz

HuangZY2 commented 1 year ago

Hi @QuietgraceH I had the same problem as you when I used soloTE:

grep: SRR8091142_soloTE_allcounts.txt: No such file or directory
grep: SRR8091142_soloTE_allcounts.txt: No such file or directory
grep: SRR8091142_soloTE_allcounts.txt: No such file or directory

*****
***** ERROR: Requested column 3, but database file - only has fields 1 - 0.

*****
***** ERROR: Requested column 3, but database file - only has fields 1 - 0.

*****
***** ERROR: Requested column 3, but database file - only has fields 1 - 0.
Error in read.table(file = file, header = header, sep = sep, quote = quote,  : 
  no lines available in input
Calls: read.delim -> read.table
Execution halted
Creating final results directory
/home/huangzy/data_huangzy/single_cell_data/02_soloTE/SRR8091142_soloTE_SoloTE_output was created
Traceback (most recent call last):
  File "/mnt/data_huangzy/single_cell_data/02_soloTE/SoloTE_1.08/SoloTE_developmental_20230503.py", line 342, in <module>
    file_table = pd.read_table(input_te_file,header=None,sep="\t")
  File "/home/huangzy/miniconda3/envs/te_annot/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1242, in read_table
    return _read(filepath_or_buffer, kwds)
  File "/home/huangzy/miniconda3/envs/te_annot/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 577, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/huangzy/miniconda3/envs/te_annot/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1407, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/home/huangzy/miniconda3/envs/te_annot/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1679, in _make_engine
    return mapping[engine](f, **self.options)
  File "/home/huangzy/miniconda3/envs/te_annot/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 93, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 555, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file

I would like to know how you solved it. And what irrelevant content did you delete from the bam file?

bvaldebenitom commented 1 year ago

@HuangZY2 can you share the contents of the SRR8091142_soloTE_SoloTE_temp directory?

HuangZY2 commented 1 year ago

Hi @bvaldebenitom

I tried different versions of the software.

Below are the contents of the SRR8091142_soloTE_SoloTE_temp folder, which results from soloTE v1.06:

total 124M
drwxrwxr-x 2 huangzy huangzy 4.0K May 17 07:37 .
drwxrwxr-x 5 huangzy huangzy 4.0K May 17 07:37 ..
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:37 SRR8091142_soloTE_allcounts_final.txt
-rw-rw-r-- 1 huangzy huangzy 2.9M May 17 07:37 SRR8091142_soloTE_final.bam
-rw-rw-r-- 1 huangzy huangzy 1.5M May 17 07:37 SRR8091142_soloTE_final.bam.bai
-rw-rw-r-- 1 huangzy huangzy 2.9M May 17 07:37 SRR8091142_soloTE_full.bam
-rw-rw-r-- 1 huangzy huangzy 2.9M May 17 07:37 SRR8091142_soloTE_full_sorted.bam
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:37 SRR8091142_soloTE_genes_2.txt
-rw-rw-r-- 1 huangzy huangzy 1.7K May 17 07:37 SRR8091142_soloTE_genes.bam
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:37 SRR8091142_soloTE_genes.txt
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:37 SRR8091142_soloTE_locustes_2.txt
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:37 SRR8091142_soloTE_locustes.txt
-rw-rw-r-- 1 huangzy huangzy  40M May 17 07:37 SRR8091142_soloTE_nogenes.bam
-rw-rw-r-- 1 huangzy huangzy 1.7K May 17 07:37 SRR8091142_soloTE_nogenes_newtag.bam
-rw-rw-r-- 1 huangzy huangzy  40M May 17 07:37 SRR8091142_soloTE_nogenes_oldtag.bam
-rw-rw-r-- 1 huangzy huangzy 9.4M May 17 07:37 SRR8091142_soloTE_nogenes_overlappingtes.bam
-rw-rw-r-- 1 huangzy huangzy 1.5M May 17 07:37 SRR8091142_soloTE_nogenes_overlappingtes.bam.bai
-rw-rw-r-- 1 huangzy huangzy  21M May 17 07:37 SRR8091142_soloTE_nogenes_overlappingtes.bed
-rw-rw-r-- 1 huangzy huangzy 987K May 17 07:37 SRR8091142_soloTE_selectedtes.bed
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:37 SRR8091142_soloTE_subftes_2.txt
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:37 SRR8091142_soloTE_subftes.txt
-rw-rw-r-- 1 huangzy huangzy 2.9M May 17 07:37 SRR8091142_soloTE_teannotated.bam
-rw-rw-r-- 1 huangzy huangzy 1.7K May 17 07:37 temp.bam

And following are the contents of the SRR8091142_soloTE_SoloTE_temp directory, which are from soloTE v1.08. Using this version does not result in the *output folder:

total 124M
drwxrwxr-x 2 huangzy huangzy 4.0K May 17 07:49 .
drwxrwxr-x 4 huangzy huangzy 4.0K May 17 07:49 ..
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:49 SRR8091142_soloTE_allcounts_final.txt
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:49 SRR8091142_soloTE_barcodes.tsv
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:49 SRR8091142_soloTE_countpercell_chr10.counts.tmp
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:49 SRR8091142_soloTE_countpercell_chr11.counts.tmp
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:49 SRR8091142_soloTE_countpercell_chr12.counts.tmp
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:49 SRR8091142_soloTE_countpercell_chr13.counts.tmp
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:49 SRR8091142_soloTE_countpercell_chr14.counts.tmp
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:49 SRR8091142_soloTE_countpercell_chr15.counts.tmp
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:49 SRR8091142_soloTE_countpercell_chr16.counts.tmp
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:49 SRR8091142_soloTE_countpercell_chr17.counts.tmp
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:49 SRR8091142_soloTE_countpercell_chr18.counts.tmp
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:49 SRR8091142_soloTE_countpercell_chr19.counts.tmp
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:49 SRR8091142_soloTE_countpercell_chr1.counts.tmp
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:49 SRR8091142_soloTE_countpercell_chr1_GL456210_random.counts.tmp
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:49 SRR8091142_soloTE_countpercell_chr1_GL456212_random.counts.tmp
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:49 SRR8091142_soloTE_countpercell_chr2.counts.tmp
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:49 SRR8091142_soloTE_countpercell_chr3.counts.tmp
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:49 SRR8091142_soloTE_countpercell_chr4.counts.tmp
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:49 SRR8091142_soloTE_countpercell_chr4_JH584293_random.counts.tmp
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:49 SRR8091142_soloTE_countpercell_chr5.counts.tmp
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:49 SRR8091142_soloTE_countpercell_chr6.counts.tmp
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:49 SRR8091142_soloTE_countpercell_chr7.counts.tmp
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:49 SRR8091142_soloTE_countpercell_chr8.counts.tmp
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:49 SRR8091142_soloTE_countpercell_chr9.counts.tmp
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:49 SRR8091142_soloTE_countpercell_chrUn_GL456367.counts.tmp
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:49 SRR8091142_soloTE_countpercell_chrUn_GL456378.counts.tmp
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:49 SRR8091142_soloTE_countpercell_chrUn_JH584304.counts.tmp
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:49 SRR8091142_soloTE_countpercell_chrX.counts.tmp
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:49 SRR8091142_soloTE_countpercell_chrX_GL456233_random.counts.tmp
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:49 SRR8091142_soloTE_countpercell_chrY.counts.tmp
-rw-rw-r-- 1 huangzy huangzy 2.9M May 17 07:49 SRR8091142_soloTE_final.bam
-rw-rw-r-- 1 huangzy huangzy 1.5M May 17 07:49 SRR8091142_soloTE_final.bam.bai
-rw-rw-r-- 1 huangzy huangzy 2.9M May 17 07:49 SRR8091142_soloTE_full.bam
-rw-rw-r-- 1 huangzy huangzy 2.9M May 17 07:49 SRR8091142_soloTE_full_sorted.bam
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:49 SRR8091142_soloTE_genes_2.txt
-rw-rw-r-- 1 huangzy huangzy 1.7K May 17 07:49 SRR8091142_soloTE_genes.bam
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:49 SRR8091142_soloTE_genes.tsv
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:49 SRR8091142_soloTE_genes.txt
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:49 SRR8091142_soloTE_locustes_2.txt
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:49 SRR8091142_soloTE_locustes.txt
-rw-rw-r-- 1 huangzy huangzy  40M May 17 07:49 SRR8091142_soloTE_nogenes.bam
-rw-rw-r-- 1 huangzy huangzy 1.7K May 17 07:49 SRR8091142_soloTE_nogenes_newtag.bam
-rw-rw-r-- 1 huangzy huangzy  40M May 17 07:49 SRR8091142_soloTE_nogenes_oldtag.bam
-rw-rw-r-- 1 huangzy huangzy 9.4M May 17 07:49 SRR8091142_soloTE_nogenes_overlappingtes.bam
-rw-rw-r-- 1 huangzy huangzy 1.5M May 17 07:49 SRR8091142_soloTE_nogenes_overlappingtes.bam.bai
-rw-rw-r-- 1 huangzy huangzy  21M May 17 07:49 SRR8091142_soloTE_nogenes_overlappingtes.bed
-rw-rw-r-- 1 huangzy huangzy 987K May 17 07:49 SRR8091142_soloTE_selectedtes.bed
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:49 SRR8091142_soloTE_subftes_2.txt
-rw-rw-r-- 1 huangzy huangzy    0 May 17 07:49 SRR8091142_soloTE_subftes.txt
-rw-rw-r-- 1 huangzy huangzy 2.9M May 17 07:49 SRR8091142_soloTE_teannotated.bam
-rw-rw-r-- 1 huangzy huangzy 1.7K May 17 07:49 temp.bam

Finally here are the BAM file and BED file I used: BAM file:

SRR8091142.610352       163     chr1    3667663 255     76M     =       3667801 193     ATATAATACATGCAAATTTTGTTCACAATGTTTCAAGTGGCTTATTTCAGAACAAACAAACACATGGATGTTATTT    AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAE<EEEE6EEA        NH:i:1  HI:i:1  AS:i:119        nM:i:5
SRR8091142.610352       83      chr1    3667801 255     55M21S  =       3667663 -193    TCCTGCCCATGCAATTAAAGGTTAGCAACCCCTTTTTAAAAAAAAAAAAAAAAAAAAAAGTACTCTGCGTTGATAC    E/E//A</E//66E///E<///E///A//EA//6AAEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA        NH:i:1  HI:i:1  AS:i:119        nM:i:5
SRR8091142.415819       163     chr1    4746419 255     10M61466N66M    =       4807975 82201   GTACATGGGGACTGTCCGCCAGCCGGTGGATGTGCGGCAACAACATGTCCGCTCCGATGCCCGCCGTTGTGCCGGC    AAAAAEEEEEEEEEEEEAEEAAEEEAEEEEEEEEEEEEEEEEEEEEEEEAAEEEEEEEEEEEEEAEEEEEEEEEEE        NH:i:1  HI:i:1  AS:i:144        nM:i:0
SRR8091142.461128       163     chr1    4746419 3       10M111265N66M   =       4857799 111456  GTACATGGGGACTTCTACTTTCCAGTCTCCTGCGATCGATTCGTAGTGCCTGTGTGGCGCGCGCGGTGCTTGACTG    AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE        NH:i:2  HI:i:1  AS:i:140        nM:i:0
SRR8091142.430655       163     chr1    4758108 0       76M     =       4758149 116     GATGGCTCAGTGGGTAAGAGCACCCGACTGCTCTTCCGAAGGTCCGAAGTTCAAATCCCAGCAACCACATGGTGGC    AAAAAEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE        NH:i:37 HI:i:1  AS:i:149        nM:i:0

BED file:

chr1    3000001 3000097 chr1|3000001|3000097|L1MdFanc_I:L1:LINE|-       14737   -
chr1    3000124 3002128 chr1|3000124|3002128|L1MdFanc_I:L1:LINE|-       14737   -
chr1    3003148 3004054 chr1|3003148|3004054|L1MdFanc_I:L1:LINE|-       2853    -
chr1    3004041 3004206 chr1|3004041|3004206|L1_Rod:L1:LINE|+   3936    +
chr1    3004271 3005001 chr1|3004271|3005001|L1_Rod:L1:LINE|+   3936    +

Thanks for your help!

bvaldebenitom commented 1 year ago

@HuangZY2 can you run the following command and share the output here?

samtools view -@ 8 -c -e 'exists([CB]) && exists([UB])' BAM

HuangZY2 commented 1 year ago

@bvaldebenitom It outputs 0 when I run the above command. Maybe this data was generated from SMART-Seq2 protocol.

samtools view -@ 8 -c -e 'exists([CB]) && exists([UB])' SRR8091142_map_data_pairedAligned.sortedByCoord.out.bam
0
bvaldebenitom commented 1 year ago

@HuangZY2 you can re-check the STAR command used to generate the alignment, but if it was indeed generate with SMART-Seq2, it is not going to work. We currently don't have support for it in SoloTE.

QuietgraceH commented 1 year ago

Hi @QuietgraceH,

you can generate a subfamily version of the results using the following command: awk 'BEGIN{FS=OFS="\t"}{if($1 ~ /SoloTE\|chr/){split($1,a,"|"); $1=a[1]"|"a[5]}; print $1,$2,$3 }' *_allcounts_final.txt > table_bysubf.txt where *_allcounts_final.txt should be replaced with the appropriate name of the final file generated in the SoloTE temp directory.

Then, use the attached script in the following way: Rscript create_subfamily_matrix_v2.R table_bysubf.txt

Upon completion, you will see a new directory named Matrix_Subfamily with the locus counts summarized by subfamily.

create_subfamily_matrix.tar.gz

Thank you so much. This method has been successfully run on my computer, and I think this function is still necessary.