"Taxa_counts" script - Githubissues

ZhangMH2000 commented 1 year ago

Dear author, Thank you for your wonderful work! I have used your taxa counts script to extract true microbiome reads. As expected, I have got 2 files: *.all.barcodes.txt and *.counts.txt. However, when I compare the two files, I found that the barcodes corresponding to taxid in *.counts.txt have barcodes which don't exist in the same taxid corresponed barcodes in *.all.barcodes.txt. It means that the *.counts.txt have more number of barcodes and corresponding counts to the barcodes and corresponding UMI of the *.all.barcodes.txt with the same taxid. I don't konw whether there is a problem ,or is there anything I don't understand clearly about scRNA. Thank you sincerely for your reply!

Yunuuuu commented 1 year ago

Does the official script work as expected?

ZhangMH2000 commented 1 year ago

About this problem, I found that there was a problem with the code s.mat = sparseMatrix(as.integer(s.bc$barcode), as.integer(s.bc$taxid), x=s.bc$umi) when running the original script, and the sparseMatrix cannot be created. I just rewrite the script with pivot_wider function and got the data frame. Speaking of this, I found that the data frame from the original script had the exactly same number of rows and columns as the dataframe output from your script, but some contents are missing in the dataframe output by your script. For example, the original script is all "eukaryotic" in the column of kingdom, while the CSV output by your script is "fungal" or empty string.

Yunuuuu commented 1 year ago

Could you please provide some examples (file) for me to reproduce the problems? and It would be better if you can provide the code you used.

ZhangMH2000 commented 1 year ago

The SRA file I used is SRR12492057. I downloaded the original bam file from SRA and use 10X bamtofastq files to convert it as I1.fastq, R1.fastq and R2.fastq. The official code in taxa_counts line75-77 can not be exacuated:

s.mat = sparseMatrix(as.integer(s.bc$barcode), as.integer(s.bc$taxid), x=s.bc$umi)
colnames(s.mat) = levels(s.bc$taxid)
rownames(s.mat) = levels(s.bc$barcode)

so I rewritethe code:

s.mat = s.bc %>%
  pivot_wider(names_from = taxid, values_from = umi) %>%
  as.data.frame() %>%
  `rownames<-`(.[[1]]) %>%
  select(-barcode) %>%
  replace(is.na(.), 0) %>%
  as.matrix() %>%
  as('sparseMatrix')

Your code can be run successfully without problem. Thank you sincerely for your effort!

ZhangMH2000 commented 1 year ago

I think the results from your script is almost the same as the output from the official script. The initial issue I posed still exists by running offiial script, so I think this issue need to be ask for the SAHMI team.

ZhangMH2000 commented 1 year ago

Hi, I have just figured out the initial issue I posed. For example, I had 12 barcodes and corresponding UMI with ‘taxid 9’ in the file .all.barcodes.txt, but I had 94 barcodes and corresponding counts with ‘taxid 9’ in the file .counts.txt. To understand why the .counts.txt file had more barcodes than the .all.barcodes.txt file for the same taxid, I investigated where the ‘taxid 9’ came from. The official script (line 89-102) extracted taxa parent information as well as the taxid, while the taxid was requalified with the name of ‘species’. However, there are also classifications under species, and NCBI taxonomy also gave them more detailed taxids. For example, taxid 9 refers to Buchnera aphidicola, while taxid 1241832 refers to Buchnera aphidicola (Acyrthosiphon lactucae). Both taxids indicate species ‘Buchnera aphidicola’, but the taxid 1241832 was considered as taxid 9 by SAHMI. Therefore, the two output files had some differences. This was the reason for my initial issue.

As for the later question, I just compared the official *.counts.txt output file with your script output again. By column-to-column comparasion, the only differences between you two is the 'kingdom' column, which was just the dismatches I mentioned above. The others were identical with the official output. I hope this may help you.

Yunuuuu commented 1 year ago

Really thanks for your details, I'll check it when having time

Yunuuuu commented 7 months ago

With the latest rsahmi package, we don't use mpa file anymore, and all downstream operations have been implemented with pure R langauge (all depends on polars package), they'll be faster. We don't transform string between " " and "_", all taxa name will be kept as the same of kraken, which makes it accurater when do kmer statistics and taxa counting.

*.all.barcodes.txt and *.counts.txt won't be use anymore, we'll return a DataFrame, taxa counting will use information from kmer function directly, they'll be consistent if you don't filter taxids in taxa_counts. I'll close this issue.

Hi, I have just figured out the initial issue I posed. For example, I had 12 barcodes and corresponding UMI with ‘taxid 9’ in the file .all.barcodes.txt, but I had 94 barcodes and corresponding counts with ‘taxid 9’ in the file .counts.txt. To understand why the .counts.txt file had more barcodes than the .all.barcodes.txt file for the same taxid, I investigated where the ‘taxid 9’ came from. The official script (line 89-102) extracted taxa parent information as well as the taxid, while the taxid was requalified with the name of ‘species’. However, there are also classifications under species, and NCBI taxonomy also gave them more detailed taxids. For example, taxid 9 refers to Buchnera aphidicola, while taxid 1241832 refers to Buchnera aphidicola (Acyrthosiphon lactucae). Both taxids indicate species ‘Buchnera aphidicola’, but the taxid 1241832 was considered as taxid 9 by SAHMI. Therefore, the two output files had some differences. This was the reason for my initial issue.

Mis-matched taxa will also be fixed by the latest package

Yunuuuu / rsahmi

"Taxa_counts" script #2