Closed ZhangMH2000 closed 7 months ago
Does the official script work as expected?
About this problem, I found that there was a problem with the code s.mat = sparseMatrix(as.integer(s.bc$barcode), as.integer(s.bc$taxid), x=s.bc$umi)
when running the original script, and the sparseMatrix cannot be created. I just rewrite the script with pivot_wider
function and got the data frame. Speaking of this, I found that the data frame from the original script had the exactly same number of rows and columns as the dataframe output from your script, but some contents are missing in the dataframe output by your script. For example, the original script is all "eukaryotic" in the column of kingdom, while the CSV output by your script is "fungal" or empty string.
Could you please provide some examples (file) for me to reproduce the problems? and It would be better if you can provide the code you used.
The SRA file I used is SRR12492057. I downloaded the original bam file from SRA and use 10X bamtofastq files to convert it as I1.fastq, R1.fastq and R2.fastq.
The official code in taxa_counts
line75-77 can not be exacuated:
s.mat = sparseMatrix(as.integer(s.bc$barcode), as.integer(s.bc$taxid), x=s.bc$umi)
colnames(s.mat) = levels(s.bc$taxid)
rownames(s.mat) = levels(s.bc$barcode)
so I rewritethe code:
s.mat = s.bc %>%
pivot_wider(names_from = taxid, values_from = umi) %>%
as.data.frame() %>%
`rownames<-`(.[[1]]) %>%
select(-barcode) %>%
replace(is.na(.), 0) %>%
as.matrix() %>%
as('sparseMatrix')
Your code can be run successfully without problem. Thank you sincerely for your effort!
I think the results from your script is almost the same as the output from the official script. The initial issue I posed still exists by running offiial script, so I think this issue need to be ask for the SAHMI team.
Hi, I have just figured out the initial issue I posed. For example, I had 12 barcodes and corresponding UMI with ‘taxid 9’ in the file .all.barcodes.txt, but I had 94 barcodes and corresponding counts with ‘taxid 9’ in the file .counts.txt. To understand why the .counts.txt file had more barcodes than the .all.barcodes.txt file for the same taxid, I investigated where the ‘taxid 9’ came from. The official script (line 89-102) extracted taxa parent information as well as the taxid, while the taxid was requalified with the name of ‘species’. However, there are also classifications under species, and NCBI taxonomy also gave them more detailed taxids. For example, taxid 9 refers to Buchnera aphidicola, while taxid 1241832 refers to Buchnera aphidicola (Acyrthosiphon lactucae). Both taxids indicate species ‘Buchnera aphidicola’, but the taxid 1241832 was considered as taxid 9 by SAHMI. Therefore, the two output files had some differences. This was the reason for my initial issue.
As for the later question, I just compared the official *.counts.txt output file with your script output again. By column-to-column comparasion, the only differences between you two is the 'kingdom' column, which was just the dismatches I mentioned above. The others were identical with the official output. I hope this may help you.
Really thanks for your details, I'll check it when having time
With the latest rsahmi
package, we don't use mpa file anymore, and all downstream operations have been implemented with pure R langauge (all depends on polars
package), they'll be faster. We don't transform string between " " and "_", all taxa name will be kept as the same of kraken
, which makes it accurater when do kmer statistics and taxa counting.
*.all.barcodes.txt
and *.counts.txt
won't be use anymore, we'll return a DataFrame
, taxa counting will use information from kmer
function directly, they'll be consistent if you don't filter taxids in taxa_counts
. I'll close this issue.
Hi, I have just figured out the initial issue I posed. For example, I had 12 barcodes and corresponding UMI with ‘taxid 9’ in the file .all.barcodes.txt, but I had 94 barcodes and corresponding counts with ‘taxid 9’ in the file .counts.txt. To understand why the .counts.txt file had more barcodes than the .all.barcodes.txt file for the same taxid, I investigated where the ‘taxid 9’ came from. The official script (line 89-102) extracted taxa parent information as well as the taxid, while the taxid was requalified with the name of ‘species’. However, there are also classifications under species, and NCBI taxonomy also gave them more detailed taxids. For example, taxid 9 refers to Buchnera aphidicola, while taxid 1241832 refers to Buchnera aphidicola (Acyrthosiphon lactucae). Both taxids indicate species ‘Buchnera aphidicola’, but the taxid 1241832 was considered as taxid 9 by SAHMI. Therefore, the two output files had some differences. This was the reason for my initial issue.
Mis-matched taxa will also be fixed by the latest package
Dear author, Thank you for your wonderful work! I have used your
taxa counts
script to extract true microbiome reads. As expected, I have got 2 files:*.all.barcodes.txt
and*.counts.txt
. However, when I compare the two files, I found that the barcodes corresponding to taxid in*.counts.txt
have barcodes which don't exist in the same taxid corresponed barcodes in*.all.barcodes.txt
. It means that the*.counts.txt
have more number of barcodes and corresponding counts to the barcodes and corresponding UMI of the*.all.barcodes.txt
with the same taxid. I don't konw whether there is a problem ,or is there anything I don't understand clearly about scRNA. Thank you sincerely for your reply!