chg60 / phamclust

Pham-based genome clustering
GNU General Public License v3.0
5 stars 1 forks source link

How can I prepare the inputfile ? #2

Closed cpplyqz closed 11 months ago

cpplyqz commented 11 months ago

Now,I have some sequence that annotated by rasttk,I want to use this software to cluster my phages ,but I don't know how to prepare the inputfile,my dir is like this : $ll ./testinputdit P1.fasta P2.fasta $less P1.fasta

P1 ATCTCCTAAACAGCACGTTTTGTTAATGACAGGCGTAATTTTAGCATTGTGCAACGCATAAAAAAAGGCCCGATACTAAAAAACCGTACCGAGCCAAAACCAATGGAGATAAT $phamcluster -g testinputdir ./testoutdir

0: runtime parameters

infile: testinput outdir: testoutdir/phamclust_21_Nov_2023 debug: False subcluster: True remove tmp: False sub dist: 0.4 sub link: single clu dist: 0.75 clu link: average nr dist: 0.25 nr link: complete metric: peq cpus: 40

1: parsing genomes

Traceback (most recent call last): File "~bin/phamclust", line 8, in sys.exit(main()) File "~/site-packages/phamclust/main.py", line 211, in main genomes = load_genomes_from_fasta_dir(infile) File "~/site-packages/phamclust/main.py", line 80, in load_genomes_from_fasta_dir genomes[name].load(f) File "~/site-packages/phamclust/genome.py", line 74, in load key, value = field.split("=") ValueError: not enough values to unpack (expected 2, got 1) I always get this error,please help me . Thank you very much!

chg60 commented 11 months ago

Hello and thank you for your interest in using PhamClust!

I see two issues with your input files - the first is that your FASTA files appear to contain nucleotide sequences, whereas PhamClust requires that FASTA inputs be protein sequences (i.e., the genes encoded by your genomes). The second issue is that your FASTA headers are not structured in a way that PhamClust will be able to utilize, as the gene orthology information is not present. See the README.md for this repository for how the FASTA headers should be structured.

I'm assuming that RASTtk provides either a FASTA amino acid or GenBank flat file output for the annotations. If this is the case, I'll recommend you use PhaMMseqs to define gene phamilies. You should invoke PhaMMseqs like this:

phammseqs /path/to/genome/annotation/fasta/or/gbk/files -o /path/to/outdir -p

Including the -p is important because it will create a file called strain_genes.tsv which PhamClust uses as its preferred input format.

You might also consider using a different tool than RASTtk for annotating phages - it is intended for bacterial gene annotation and does an OK but not exceptional job of auto-annotating phages.

I'm happy to provide further assistance as needed; please let me know how this goes!

cpplyqz commented 11 months ago

It worked well,Mr.Gauthier! I fellow your step to generate the strain_genes.tsv and I ran through the process very smoothly, thank you again for your prompt response and good luck with your research! here is part of my result : $ls phamcluster/phamclust_22_Nov_2023/ 40c31f606878e4f11f2ff0e12dfeaba5.tmp cluster_1 cluster_2 peq_heatmap.html phamclust.log singletons 微信图片_20231122090450

cpplyqz commented 11 months ago

It worked well,Mr.Gauthier! I fellow your step to generate the strain_genes.tsv and I ran through the process very smoothly, thank you again for your prompt response and good luck with your research! here is part of my result : $ls phamcluster/phamclust_22_Nov_2023/ 40c31f606878e4f11f2ff0e12dfeaba5.tmp cluster_1 cluster_2 peq_heatmap.html phamclust.log singletons

cppyqz

@. | ---- Replied Message ---- | From | Christian @.> | | Date | 11/21/2023 20:25 | | To | @.> | | Cc | @.> , @.***> | | Subject | Re: [chg60/phamclust] How can I prepare the inputfile ? (Issue #2) |

Hello and thank you for your interest in using PhamClust!

I see two issues with your input files - the first is that your FASTA files appear to contain nucleotide sequences, whereas PhamClust requires that FASTA inputs be protein sequences (i.e., the genes encoded by your genomes). The second issue is that your FASTA headers are not structured in a way that PhamClust will be able to utilize, as the gene orthology information is not present. See the README.md for this repository for how the FASTA headers should be structured.

I'm assuming that RASTtk provides either a FASTA amino acid or GenBank flat file output for the annotations. If this is the case, I'll recommend you use PhaMMseqs to define gene phamilies. You should invoke PhaMMseqs like this:

phammseqs /path/to/genome/annotation/fasta/or/gbk/files -o /path/to/outdir -p

Including the -p is important because it will create a file called strain_genes.tsv which PhamClust uses as its preferred input format.

I'm happy to provide further assistance as needed; please let me know how this goes!

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

chg60 commented 11 months ago

I'm delighted to hear that you were able to get it working successfully, and I hope you find the output from PhamClust useful to your research!