JiekaiLab / scTE

MIT License
85 stars 26 forks source link

Output matrix with a series of numbers in gene/TE names #83

Closed synnimeng closed 1 week ago

synnimeng commented 2 months ago

Thanks for scTE! I got a strange output like this when running with self-made index, showing a series of numbers before correct gene/TE names.

image

I built my index use:

scTE_build -te $te -gene $gene -o GRCh38.p14

Then I run scTE:

scTE -p 20 -i Aligned.sortedByCoord.out.bam -o scte -x GRCh38.p14.exclusive.idx -CB CR -UMI UR

There was no error and the log file outputs:

Sample = scte Reference annotation index = GRCh38.p14.exclusive.idx Minimum number of genes required = 200 Minimum number of counts required = None Number of threads = 20

INFO : Loading the genome annotation index... 2024-03-21 19:05:41 INFO : Loaded 'GRCh38.p14.exclusive.idx' binary file with 5983613 items INFO : Finished loading the genome annotation index... 2024-03-21 19:06:24

INFO : Processing BAM/SAM files ...2024-03-21 19:06:24 INFO : Input SAM/BAM file appears to be valid INFO : Done BAM/SAM files processing ...2024-03-21 19:21:12

INFO : Splitting ...2024-03-21 19:21:12 INFO : Executing multiple thread path with 20 threads ['0', '1', '10', '100', '1000', '1001', '1002', '1003', '1004', '1005', '1006', '1007', '1008', '1009', '101', '1010', '1011', '1012', '1013', '1014', '1015', '1016', '1017', '1018', '1019', '102', '1020', '1021', '1022', '1023', '1024', '1025', '1026', '1027', '1028', '1029', '103', '1030', '1031', '1032', '1033', '1034', '1035', '1036', '1037', '1038', '1039', '104', '1040', '1041', '1042', '1043', '1044', '1045', '1046', '1047', '1048', '1049', '105', '1050', '1051', '1052', '1053', '1054', '1055', '1056', '1057', '1058', '1059', '106', '1060', '1061', '1062', '1063', '1064', '1065', '1066', '1067', '1068', '1069', '107', '1070', '1071', '1072', '1073', '1074', '1075', '1076', '1077', '1078', '1079', '108', '1080', '1081', '1082', '1083', '1084', '1085', '1086', '1087', '1088', '1089', '109', '1090', '1091', '1092', '1093', '1094', '1095', '1096', '1097', '1098', '1099', '11', '110', '1100', '1101', '1102', '1103', '1104', '1105', '1106', '1107', '1108', '1109', '111', '1110', '1111', '1112', '1113', '1114', '1115', '1116', '1117', '1118', '1119', '112', '1120', '1121', '1122', '1123', '1124', '1125', '1126', '1127', '1128', '1129', '113', '1130', '1131', '1132', '1133', '1134', '1135', '1136', '1137', '1138', '1139', '114', '1140', ...'99', '990', '991', '992', '993', '994', '995', '996', '997', '998', '999', 'X', 'Y'] CR UR good

So how should I deal with that? Are there any mistakes while I built index?

jphe commented 2 months ago

The -te option requires six columns bed file for transposable elements annotation, you need to convert the /rmsk.txt.gz file as bed format file before run scTE_build

synnimeng commented 2 months ago

The -te option requires six columns bed file for transposable elements annotation, you need to convert the /rmsk.txt.gz file as bed format file before run scTE_build

Thanks for your reply, it helps a lot!

synnimeng commented 2 months ago

However I have another question. My scTE result got columns named "5S_rRNA", "5_8S_rRNA", "7SK". I found these ids in gtf. 5S_rRNA:

chr1    ENSEMBL exon    182944365       182944490       .       -       .       gene_id "ENSG00000285609.1"; transcript_id "ENST00000648701.1"; gene__type "rRNA_pseudogene"; gene_name "5S_rRNA"; transcript_type "rRNA_pseudogene"; gene_name "5S_rRNA"; transcript_type "rRNA_pseudogene"; transcript_name "5S_rRNA.6-201"; exon_number 1; exon_id "ENSE00003839665.1"; level 3; tag "basic"; tag "Ensembl_canonical";

7SK:

chr1    ENSEMBL exon    9947318 9947636 .       +       .       gene_id "ENSG00000202415.1"; transcript_id "ENST00000365545.1"; gene_type "misc_RNA"; gene_name "RN7SKP269"; transcript_type "misc_RNA"; transcript_name "RN7SKP269-201"; exon_number 1; exon_id "ENSE00001440308.1"; level 3; transcript_support_level "NA"; hgnc_id "HGNC:45993"; tag "basic"; tag "Ensembl_canonical";

There are no 'protein_coding' or 'lincRNA' in these lines. Is it correct? My gtf file is gencode.v45.annotation.gtf.gz

jphe commented 1 month ago

The 'protein_coding' or 'lincRNA' tag are not used in scTE and will not have any impact on the execution of scTE.