caozhichongchong / arg_ranker

MIT License
23 stars 11 forks source link

HiFi reads #10

Closed ye00ye closed 1 year ago

ye00ye commented 1 year ago

hello, i want to ask if the pipeline is suited for HiFi reads? or should I treat every HiFi read as single genome and use this pipeline in genome method.

caozhichongchong commented 1 year ago

Hi,

Thank you for reaching out!

Good question! We haven't tested this pipeline on HIFI reads, but you can use it without some adjustments. Yes, it would work if you treat each HIFI read or each assembly bin as a single genome.

Alternatively, you can treat a whole sample as one genome, and count ARGs of different risk ranks using blast results from output arg_ranking/search_output/*.blast.txt.filter and this ARG rank table. Basically, you can pd.merge blast result and ARG rank table to link ARGs detected on reads to their risk ranks. Based on what you are testing, you can summarize ARG risks for each assembly bin, or for each sample. I'm happy to chat more or go through your results together if you want to :)

Hope it helps!

Best regards, Anni

ye00ye commented 1 year ago

Thanks for your help.

Now I'm trying to use SARG and this rank frame to conduct my resistome study with HiFi metagenomic technique. If I get some interesting results i hope to discuss with you throughly.

Another question. Recently i read a paper written from Professor Zhang, and found that author directly map nanopore reads to SARG using LAST tool without gene prediction (such as prodigal), so i want to know if read-mapping is less accurate than gene-mapping or not. and it seems that in ARG-OAP, the NGS reads are also directly mapped to SARG database without gene prediction.
(reads直接比对数据库,和预测基因比对数据库,前者是否精确度不如后者?)

Hope your help. ye--ye

@. | ---- Replied Message ---- | From | Anni @.> | | Date | 1/7/2023 07:00 | | To | @.> | | Cc | @.> , @.***> | | Subject | Re: [caozhichongchong/arg_ranker] HiFi reads (Issue #10) |

Hi,

Thank you for reaching out!

Good question! We haven't tested this pipeline on HIFI reads, but you can use it without some adjustments. Yes, it would work if you treat each HIFI read or each assembly bin as a single genome.

Alternatively, you can treat a whole sample as one genome, and count ARGs of different risk ranks using blast results from output arg_ranking/search_output/*.blast.txt.filter and this ARG rank table. Basically, you can pd.merge blast result and ARG rank table to link ARGs detected on reads to their risk ranks. Based on what you are testing, you can summarize ARG risks for each assembly bin, or for each sample. I'm happy to chat more or go through your results together if you want to :)

Hope it helps!

Best regards, Anni

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

caozhichongchong commented 1 year ago

Hi ye--ye,

Sure! I'm happy to chat more about your results :) You also can reach out to me via caozhichongchong@gmail.com or anniz44@mit.edu if you want to.

Good question!

I think we are talking about a few things: 1) whether to predict genes or screen out non-ORF (open reading frames) regions from our samples before mapping 2) mapping methods

1) It's hard to predict genes from short NGS reads because most reads are shorter than the length of a gene. There might be some tools can handle that, but I'm not confident how well it works. It would be hard to predict genes even for some short nanopore reads, if a gene extends past the end of a read. One way to go around it is to assembly genomes and then predict genes, but it loses the abundance information. You can align your reads back to the assembly to get abundance. If you are concerned that some non-ORF regions could be super similar to a part of ARGs, it might happen for NGS reads but not likely for long reads, because you can require the alignment to have a good length coverage of the ARG, i.e. 80% (as required in both ARGs-OAP and arg_ranker).

2) different mapping methods I think read mapping and nucleotide similarity search are essentially the same - they all depend on the same search algorithm (k-mer) and quite similar scoring algorithms. Read mapping is trying to find which part of the reference genome a read comes from, assuming that the reference genome is an ancestor/relative to the target bacterial strain in the sample. Similarity search is trying to find two sequences that share more similarity than would be expected by chance (statistically).

I think amino-acid search is a better way to annotate ARGs for general purpose. Genes function through proteins. For example, 2 ARGs with the same amino acid sequence but different nucleotide sequences (synonymous), could confer the same function. Moreover, amino-acid search is more sensitive to find homologous genes, because it's more strongly impacted by mutations that shift reading frames, extra readings if you are interested.

However, nucleotide search is a better way for ARG risk ranking, especially for mobility. We infer the probability that an ARG in your sample is mobile based on whether it's highly similar to a mobile gene in my database. Imagine I found an ARG on a plasmid and it's quite similar to an ARG in E.coli genome. We would want to compute the possibility that these two ARGs are shared because of horizontal gene transfer, by computing the evolution time between 2 genes. For evolution time, every mutation counts, both nonsynonymous and synonymous ones, and nucleotide search can give us that information.

Hope it helps! Anni

ye00ye commented 1 year ago

Thanks for your patient answers.

    You said nucleotide search is a better way for ARG risk ranking. I wonder whether arg_ranker uses nucleotide search to evaluate ARG ranker or not, because I  only found two protein fasta file (all_KO30.pro.fasta and SARG.db.fasta) in (arg_ranker/data) which you may use to conduct ARG mapping. If you use nucleotide search to evaluate ARG ranker, the arg_ranker should contain nucleotide acid ARG database.
hope your answer ye--ye

@. | ---- Replied Message ---- | From | Anni @.> | | Date | 1/8/2023 04:37 | | To | @.> | | Cc | @.> , @.***> | | Subject | Re: [caozhichongchong/arg_ranker] HiFi reads (Issue #10) |

Hi ye--ye,

Sure! I'm happy to chat more about your results :) You also can reach out to me via @. or @. if you want to.

Good question!

I think we are talking about a few things:

whether to predict genes or screen out non-ORF (open reading frames) regions from our samples before mapping mapping methods read mapping, like LAST or bowtie nucleotide similarity/homologous search, like blastn amino-acid similarity/homologous search, like blastp or blastx

It's hard to predict genes from short NGS reads because most reads are shorter than the length of a gene. There might be some tools can handle that, but I'm not confident how well it works. It would be hard to predict genes even for some short nanopore reads, if a gene extends past the end of a read. One way to go around it is to assembly genomes and then predict genes, but it loses the abundance information. You can align your reads back to the assembly to get abundance. If you are concerned that some non-ORF regions could be super similar to a part of ARGs, it might happen for NGS reads but not likely for long reads, because you can require the alignment to have a good length coverage of the ARG, i.e. 80% (as required in both ARGs-OAP and arg_ranker).

different mapping methods I think read mapping and nucleotide similarity search are essentially the same - they all depend on the same search algorithm (k-mer) and quite similar scoring algorithms. Read mapping is trying to find which part of the reference genome a read comes from, assuming that the reference genome is an ancestor/relative to the target bacterial strain in the sample. Similarity search is trying to find two sequences that share more similarity than would be expected by chance (statistically).

I think amino-acid search is a better way to annotate ARGs for general purpose. Genes function through proteins. For example, 2 ARGs with the same amino acid sequence but different nucleotide sequences (synonymous), could confer the same function. Moreover, amino-acid search is more sensitive to find homologous genes, because it's more strongly impacted by mutations that shift reading frames, extra readings if you are interested.

However, nucleotide search is a better way for ARG risk ranking, especially for mobility. We infer the probability that an ARG in your sample is mobile based on whether it's highly similar to a mobile gene in my database. Imagine I found an ARG on a plasmid and it's quite similar to an ARG in E.coli genome. We would want to compute the possibility that these two ARGs are shared because of horizontal gene transfer, by computing the evolution time between 2 genes. For evolution time, every mutation counts, both nonsynonymous and synonymous ones, and nucleotide search can give us that information.

Hope it helps! Anni

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

caozhichongchong commented 1 year ago

Hi ye--ye,

You are absolutely right! Unfortunately, the version of SARG database I used only contains amino-acid seqs - same with many existing ARG databases. It would be great if future studies collect dna seqs of ARGs and do risk assessment based on that. Sorry that maybe what I said was confusing. I meant to say the best way to do ARG risk ranking is through nucleotide search!

Hope it helps, Anni

ye00ye commented 1 year ago

thanks for your patient answer

ye--ye

@. | ---- Replied Message ---- | From | Anni @.> | | Date | 1/10/2023 09:50 | | To | @.> | | Cc | @.> , @.***> | | Subject | Re: [caozhichongchong/arg_ranker] HiFi reads (Issue #10) |

Hi ye--ye,

You are absolutely right! Unfortunately, the version of SARG database I used only contains amino-acid seqs - same with many existing ARG databases. It would be great if future studies collect dna seqs of ARGs and do risk assessment based on that. Sorry that maybe what I said was confusing. I meant to say the best way to do ARG risk ranking is through nucleotide search!

Hope it helps, Anni

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

caozhichongchong commented 1 year ago

You are welcome :) Happy to chat about your results if you want to!

Good luck with your research! Anni