Closed biohiran closed 1 year ago
Hi @biohiran,
Regarding the second point, we are working on applying the EM algorithm to further increase the TEs detected with locus resolution.
Hello. I am using the output results (barcode.tsv genes.tsv matrix.mtx) for single cell analysis. But there are more than 900,000 features in my gene.tsv file, is this normal?When I run this command, it takes a long time to output the results. This may be due to data redundancy. Will this phenomenon affect the clustering results?
The command is listed:
pbmc <-FindVariableFeatures(pbmc,selection.metthod="vst",nfeatures=2000)
The pbmc is listed in R:
Bests,
Hi @QuietgraceH,
It looks like you are not getting any errors now! (related to Issue #8).
Regarding this question, by default the pipeline reports all TEs with at least 1 UMI. However, from practical experience keeping only those TEs with at least 3, and in some cases, with at least 10 UMIs, reduced considerably the number of TEs in the matrix.
Please let me know if you need any help regarding this pre-processing step, as it can be done in R or when using the pipeline. The first option is more flexible as it allows you to compare the impact of several UMI thresholds on the number of detected TEs.
Thanks. My previous problem has been solved. I deleted the irrelevant content of the bam format file and it can be run to get the result. Now there is a new problem. I want to use the results of the soloTE tool for cluster analysis, but the gene.tsv file has too many features, which is because of the addition of chromosome positions. Since I want to analyze in units of subfamily, regardless of chromosomal position. As in the example below, I hope to combine the two data into one data to see the expression level. So how to solve this problem. example: SoloTE|10|119523612|119524306|L1MB7:L1:LINE|- SoloTE|10|119524316|119525516|L1MB7:L1:LINE|-
Hi @QuietgraceH,
you can generate a subfamily version of the results using the following command:
awk 'BEGIN{FS=OFS="\t"}{if($1 ~ /SoloTE\|chr/){split($1,a,"|"); $1=a[1]"|"a[5]}; print $1,$2,$3 }' *_allcounts_final.txt > table_bysubf.txt
where *_allcounts_final.txt
should be replaced with the appropriate name of the final file generated in the SoloTE temp directory.
Then, use the attached script in the following way:
Rscript create_subfamily_matrix_v2.R table_bysubf.txt
Upon completion, you will see a new directory named Matrix_Subfamily
with the locus counts summarized by subfamily.
Hi @QuietgraceH I had the same problem as you when I used soloTE:
grep: SRR8091142_soloTE_allcounts.txt: No such file or directory
grep: SRR8091142_soloTE_allcounts.txt: No such file or directory
grep: SRR8091142_soloTE_allcounts.txt: No such file or directory
*****
***** ERROR: Requested column 3, but database file - only has fields 1 - 0.
*****
***** ERROR: Requested column 3, but database file - only has fields 1 - 0.
*****
***** ERROR: Requested column 3, but database file - only has fields 1 - 0.
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
no lines available in input
Calls: read.delim -> read.table
Execution halted
Creating final results directory
/home/huangzy/data_huangzy/single_cell_data/02_soloTE/SRR8091142_soloTE_SoloTE_output was created
Traceback (most recent call last):
File "/mnt/data_huangzy/single_cell_data/02_soloTE/SoloTE_1.08/SoloTE_developmental_20230503.py", line 342, in <module>
file_table = pd.read_table(input_te_file,header=None,sep="\t")
File "/home/huangzy/miniconda3/envs/te_annot/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1242, in read_table
return _read(filepath_or_buffer, kwds)
File "/home/huangzy/miniconda3/envs/te_annot/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 577, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/home/huangzy/miniconda3/envs/te_annot/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1407, in __init__
self._engine = self._make_engine(f, self.engine)
File "/home/huangzy/miniconda3/envs/te_annot/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1679, in _make_engine
return mapping[engine](f, **self.options)
File "/home/huangzy/miniconda3/envs/te_annot/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 93, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 555, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file
I would like to know how you solved it. And what irrelevant content did you delete from the bam file?
@HuangZY2 can you share the contents of the SRR8091142_soloTE_SoloTE_temp
directory?
Hi @bvaldebenitom
I tried different versions of the software.
Below are the contents of the SRR8091142_soloTE_SoloTE_temp
folder, which results from soloTE v1.06
:
total 124M
drwxrwxr-x 2 huangzy huangzy 4.0K May 17 07:37 .
drwxrwxr-x 5 huangzy huangzy 4.0K May 17 07:37 ..
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:37 SRR8091142_soloTE_allcounts_final.txt
-rw-rw-r-- 1 huangzy huangzy 2.9M May 17 07:37 SRR8091142_soloTE_final.bam
-rw-rw-r-- 1 huangzy huangzy 1.5M May 17 07:37 SRR8091142_soloTE_final.bam.bai
-rw-rw-r-- 1 huangzy huangzy 2.9M May 17 07:37 SRR8091142_soloTE_full.bam
-rw-rw-r-- 1 huangzy huangzy 2.9M May 17 07:37 SRR8091142_soloTE_full_sorted.bam
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:37 SRR8091142_soloTE_genes_2.txt
-rw-rw-r-- 1 huangzy huangzy 1.7K May 17 07:37 SRR8091142_soloTE_genes.bam
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:37 SRR8091142_soloTE_genes.txt
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:37 SRR8091142_soloTE_locustes_2.txt
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:37 SRR8091142_soloTE_locustes.txt
-rw-rw-r-- 1 huangzy huangzy 40M May 17 07:37 SRR8091142_soloTE_nogenes.bam
-rw-rw-r-- 1 huangzy huangzy 1.7K May 17 07:37 SRR8091142_soloTE_nogenes_newtag.bam
-rw-rw-r-- 1 huangzy huangzy 40M May 17 07:37 SRR8091142_soloTE_nogenes_oldtag.bam
-rw-rw-r-- 1 huangzy huangzy 9.4M May 17 07:37 SRR8091142_soloTE_nogenes_overlappingtes.bam
-rw-rw-r-- 1 huangzy huangzy 1.5M May 17 07:37 SRR8091142_soloTE_nogenes_overlappingtes.bam.bai
-rw-rw-r-- 1 huangzy huangzy 21M May 17 07:37 SRR8091142_soloTE_nogenes_overlappingtes.bed
-rw-rw-r-- 1 huangzy huangzy 987K May 17 07:37 SRR8091142_soloTE_selectedtes.bed
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:37 SRR8091142_soloTE_subftes_2.txt
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:37 SRR8091142_soloTE_subftes.txt
-rw-rw-r-- 1 huangzy huangzy 2.9M May 17 07:37 SRR8091142_soloTE_teannotated.bam
-rw-rw-r-- 1 huangzy huangzy 1.7K May 17 07:37 temp.bam
And following are the contents of the SRR8091142_soloTE_SoloTE_temp
directory, which are from soloTE v1.08
. Using this version does not result in the *output folder:
total 124M
drwxrwxr-x 2 huangzy huangzy 4.0K May 17 07:49 .
drwxrwxr-x 4 huangzy huangzy 4.0K May 17 07:49 ..
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:49 SRR8091142_soloTE_allcounts_final.txt
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:49 SRR8091142_soloTE_barcodes.tsv
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:49 SRR8091142_soloTE_countpercell_chr10.counts.tmp
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:49 SRR8091142_soloTE_countpercell_chr11.counts.tmp
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:49 SRR8091142_soloTE_countpercell_chr12.counts.tmp
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:49 SRR8091142_soloTE_countpercell_chr13.counts.tmp
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:49 SRR8091142_soloTE_countpercell_chr14.counts.tmp
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:49 SRR8091142_soloTE_countpercell_chr15.counts.tmp
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:49 SRR8091142_soloTE_countpercell_chr16.counts.tmp
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:49 SRR8091142_soloTE_countpercell_chr17.counts.tmp
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:49 SRR8091142_soloTE_countpercell_chr18.counts.tmp
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:49 SRR8091142_soloTE_countpercell_chr19.counts.tmp
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:49 SRR8091142_soloTE_countpercell_chr1.counts.tmp
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:49 SRR8091142_soloTE_countpercell_chr1_GL456210_random.counts.tmp
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:49 SRR8091142_soloTE_countpercell_chr1_GL456212_random.counts.tmp
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:49 SRR8091142_soloTE_countpercell_chr2.counts.tmp
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:49 SRR8091142_soloTE_countpercell_chr3.counts.tmp
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:49 SRR8091142_soloTE_countpercell_chr4.counts.tmp
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:49 SRR8091142_soloTE_countpercell_chr4_JH584293_random.counts.tmp
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:49 SRR8091142_soloTE_countpercell_chr5.counts.tmp
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:49 SRR8091142_soloTE_countpercell_chr6.counts.tmp
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:49 SRR8091142_soloTE_countpercell_chr7.counts.tmp
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:49 SRR8091142_soloTE_countpercell_chr8.counts.tmp
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:49 SRR8091142_soloTE_countpercell_chr9.counts.tmp
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:49 SRR8091142_soloTE_countpercell_chrUn_GL456367.counts.tmp
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:49 SRR8091142_soloTE_countpercell_chrUn_GL456378.counts.tmp
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:49 SRR8091142_soloTE_countpercell_chrUn_JH584304.counts.tmp
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:49 SRR8091142_soloTE_countpercell_chrX.counts.tmp
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:49 SRR8091142_soloTE_countpercell_chrX_GL456233_random.counts.tmp
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:49 SRR8091142_soloTE_countpercell_chrY.counts.tmp
-rw-rw-r-- 1 huangzy huangzy 2.9M May 17 07:49 SRR8091142_soloTE_final.bam
-rw-rw-r-- 1 huangzy huangzy 1.5M May 17 07:49 SRR8091142_soloTE_final.bam.bai
-rw-rw-r-- 1 huangzy huangzy 2.9M May 17 07:49 SRR8091142_soloTE_full.bam
-rw-rw-r-- 1 huangzy huangzy 2.9M May 17 07:49 SRR8091142_soloTE_full_sorted.bam
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:49 SRR8091142_soloTE_genes_2.txt
-rw-rw-r-- 1 huangzy huangzy 1.7K May 17 07:49 SRR8091142_soloTE_genes.bam
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:49 SRR8091142_soloTE_genes.tsv
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:49 SRR8091142_soloTE_genes.txt
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:49 SRR8091142_soloTE_locustes_2.txt
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:49 SRR8091142_soloTE_locustes.txt
-rw-rw-r-- 1 huangzy huangzy 40M May 17 07:49 SRR8091142_soloTE_nogenes.bam
-rw-rw-r-- 1 huangzy huangzy 1.7K May 17 07:49 SRR8091142_soloTE_nogenes_newtag.bam
-rw-rw-r-- 1 huangzy huangzy 40M May 17 07:49 SRR8091142_soloTE_nogenes_oldtag.bam
-rw-rw-r-- 1 huangzy huangzy 9.4M May 17 07:49 SRR8091142_soloTE_nogenes_overlappingtes.bam
-rw-rw-r-- 1 huangzy huangzy 1.5M May 17 07:49 SRR8091142_soloTE_nogenes_overlappingtes.bam.bai
-rw-rw-r-- 1 huangzy huangzy 21M May 17 07:49 SRR8091142_soloTE_nogenes_overlappingtes.bed
-rw-rw-r-- 1 huangzy huangzy 987K May 17 07:49 SRR8091142_soloTE_selectedtes.bed
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:49 SRR8091142_soloTE_subftes_2.txt
-rw-rw-r-- 1 huangzy huangzy 0 May 17 07:49 SRR8091142_soloTE_subftes.txt
-rw-rw-r-- 1 huangzy huangzy 2.9M May 17 07:49 SRR8091142_soloTE_teannotated.bam
-rw-rw-r-- 1 huangzy huangzy 1.7K May 17 07:49 temp.bam
Finally here are the BAM file and BED file I used: BAM file:
SRR8091142.610352 163 chr1 3667663 255 76M = 3667801 193 ATATAATACATGCAAATTTTGTTCACAATGTTTCAAGTGGCTTATTTCAGAACAAACAAACACATGGATGTTATTT AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAE<EEEE6EEA NH:i:1 HI:i:1 AS:i:119 nM:i:5
SRR8091142.610352 83 chr1 3667801 255 55M21S = 3667663 -193 TCCTGCCCATGCAATTAAAGGTTAGCAACCCCTTTTTAAAAAAAAAAAAAAAAAAAAAAGTACTCTGCGTTGATAC E/E//A</E//66E///E<///E///A//EA//6AAEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA NH:i:1 HI:i:1 AS:i:119 nM:i:5
SRR8091142.415819 163 chr1 4746419 255 10M61466N66M = 4807975 82201 GTACATGGGGACTGTCCGCCAGCCGGTGGATGTGCGGCAACAACATGTCCGCTCCGATGCCCGCCGTTGTGCCGGC AAAAAEEEEEEEEEEEEAEEAAEEEAEEEEEEEEEEEEEEEEEEEEEEEAAEEEEEEEEEEEEEAEEEEEEEEEEE NH:i:1 HI:i:1 AS:i:144 nM:i:0
SRR8091142.461128 163 chr1 4746419 3 10M111265N66M = 4857799 111456 GTACATGGGGACTTCTACTTTCCAGTCTCCTGCGATCGATTCGTAGTGCCTGTGTGGCGCGCGCGGTGCTTGACTG AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE NH:i:2 HI:i:1 AS:i:140 nM:i:0
SRR8091142.430655 163 chr1 4758108 0 76M = 4758149 116 GATGGCTCAGTGGGTAAGAGCACCCGACTGCTCTTCCGAAGGTCCGAAGTTCAAATCCCAGCAACCACATGGTGGC AAAAAEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE NH:i:37 HI:i:1 AS:i:149 nM:i:0
BED file:
chr1 3000001 3000097 chr1|3000001|3000097|L1MdFanc_I:L1:LINE|- 14737 -
chr1 3000124 3002128 chr1|3000124|3002128|L1MdFanc_I:L1:LINE|- 14737 -
chr1 3003148 3004054 chr1|3003148|3004054|L1MdFanc_I:L1:LINE|- 2853 -
chr1 3004041 3004206 chr1|3004041|3004206|L1_Rod:L1:LINE|+ 3936 +
chr1 3004271 3005001 chr1|3004271|3005001|L1_Rod:L1:LINE|+ 3936 +
Thanks for your help!
@HuangZY2 can you run the following command and share the output here?
samtools view -@ 8 -c -e 'exists([CB]) && exists([UB])' BAM
@bvaldebenitom It outputs
0
when I run the above command. Maybe this data was generated from SMART-Seq2 protocol.
samtools view -@ 8 -c -e 'exists([CB]) && exists([UB])' SRR8091142_map_data_pairedAligned.sortedByCoord.out.bam
0
@HuangZY2 you can re-check the STAR command used to generate the alignment, but if it was indeed generate with SMART-Seq2, it is not going to work. We currently don't have support for it in SoloTE.
Hi @QuietgraceH,
you can generate a subfamily version of the results using the following command:
awk 'BEGIN{FS=OFS="\t"}{if($1 ~ /SoloTE\|chr/){split($1,a,"|"); $1=a[1]"|"a[5]}; print $1,$2,$3 }' *_allcounts_final.txt > table_bysubf.txt
where*_allcounts_final.txt
should be replaced with the appropriate name of the final file generated in the SoloTE temp directory.Then, use the attached script in the following way:
Rscript create_subfamily_matrix_v2.R table_bysubf.txt
Upon completion, you will see a new directory named
Matrix_Subfamily
with the locus counts summarized by subfamily.
Thank you so much. This method has been successfully run on my computer, and I think this function is still necessary.
I have a fes query on the scope of using SoloTE for expression analysis of human ERVs in single cell data, keeping in mind that there are very few tools to use for this objective in the single cell space.
Considering hERVs as a one class of Transposable elements, it does looks like could use SoloTE for this purpose. What is your thought on this ? is there anything that you think would affect the confidence of the results for hERVS?
Transposable elements could be repeated at different locus/region in genome. That means biologicall same reads can be mapped to multiple loci. If we use "outSAMattributes' as i to bam, technically we are limiting to one potential alignment , but discarding other biologically possible mapping . is that correct?