u15412611 commented 2 years ago

Hi Thank you for scripts to determine WSGUPS I come up with this error when i run the command snakemake --snakefile Snakefile --use-conda -j 8 -k Building DAG of jobs... Using shell: /usr/bin/bash Provided cores: 8 Rules claiming more threads will be scaled down. Job stats: job count min threads max threads

absrel_stats 1 1 1 aggregate_fams 1 1 1 final 1 1 1 final_stats 1 1 1 make_families 1 1 1 move_absrel 1 1 1 move_fubar 1 1 1 total 7 1 1

Select jobs to execute...

[Thu Jun 23 00:20:35 2022] localcheckpoint make_families: input: proteinortho/protein_families.poff.tsv output: families/faas jobid: 6 reason: Missing output files: families/faas resources: tmpdir=/tmp Downstream jobs will be updated after completion.

Traceback (most recent call last): File "/home/percy/wsgups/.snakemake/scripts/tmp83tsi3_w.pillars.py", line 33, in fam = pd.read_csv("fam.txt", sep="\t", header=None) File "/root/anaconda3/envs/snakemake/lib/python3.10/site-packages/pandas/util/_decorators.py", line 311, in wrapper return func(*args, kwargs) File "/root/anaconda3/envs/snakemake/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv return _read(filepath_or_buffer, kwds) File "/root/anaconda3/envs/snakemake/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 575, in _read parser = TextFileReader(filepath_or_buffer, kwds) File "/root/anaconda3/envs/snakemake/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 933, in init self._engine = self._make_engine(f, self.engine) File "/root/anaconda3/envs/snakemake/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1235, in _make_engine return mapping[engine](f, self.options) File "/root/anaconda3/envs/snakemake/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 75, in init self._reader = parsers.TextReader(src, kwds) File "pandas/_libs/parsers.pyx", line 551, in pandas._libs.parsers.TextReader.cinit pandas.errors.EmptyDataError: No columns to parse from file [Thu Jun 23 00:20:36 2022] Error in rule make_families: jobid: 6 output: families/faas

RuleException: CalledProcessError in line 58 of /home/percy/wsgups/Snakefile: Command 'set -euo pipefail; /root/anaconda3/envs/snakemake/bin/python3.10 /home/percy/wsgups/.snakemake/scripts/tmp83tsi3_w.pillars.py' returned non-zero exit status 1. File "/home/percy/wsgups/Snakefile", line 58, in __rule_make_families File "/root/anaconda3/envs/snakemake/lib/python3.10/concurrent/futures/thread.py", line 58, in run Exiting because a job execution failed. Look above for error message Complete log: .snakemake/log/2022-06-23T002033.579836.snakemake.log

could you assit me am using snakemake v7.8.1

danielzmbp commented 2 years ago

Hi, thanks for the message. Can you check that the file "proteinortho/protein_families.poff.tsv" was created after the previous rule and is not empty? How do your input files look like? If you send them to me I can also take a look.

u15412611 commented 2 years ago

Hi my input data loos like this: .faa

1_1 MLSEEKKESEVEIKPTEDSVSEKPSVADVKKVADVKKVADVKKVADVKK 1_2 MMYRSRLGTDLSNITLDYVSSINDDSEIALYDIIGSQAHTIMLLQNNIITKNDAKKILSS LENLKNEKFDSSSGAEDIHELIESLVIKKAGMASGGKMHTARSRNDQVVLDIRMKIRDDI NIICNCLLDTIESLVSVSKNHQKTIMPFYTHLQQAQAGLFSHYLLAQADVLSRDFQRLFD TFQRINQSPLGAGPVGGTSIAIDRHSTAKMLGFDGVVENSIDATSARDFVAEYVAMISIL MTNLSRISEDFIIWSTSEFSFIELSDEFTSPSSVMPQKKNPDILELTRGKTAEIIGNLTA ILTTIKGLASGYGRDLQQIKSSIWSTSKISISALLIIKSIVLTMKVNEKQMKKVTESSNL IALDIAEKLVQEGIPFRVTHKIAGVLVQLAHNSKKPISKLTTLEIKKSVEGTKIDPKIVS

.fna

NODE_23_length_59792_cov_23.204747_1 # 1 # 147 # -1 # ID=1_1;partial=10;start_type=TTG;rbs_motif=None;rbs_spacer=None;gc_cont=0.299 TTGTTGTCTGAGGAGAAAAAAGAATCTGAAGTAGAAATAAAACCTACAGAAGATTCTGTATCTGAAAAACCATCAGTTGCAGATGTTAAAAAAGTTGCAGATGTTAAAAAAGTTGCAGATGTTAAAAAAGTTGCAGATGTTAAAAAA NODE_23_length_59792_cov_23.204747_2 # 523 # 1983 # -1 # ID=1_2;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.300 ATGATGTATCGCTCGCGACTTGGTACTGATTTGAGTAATATCACTCTGGATTATGTTTCATCAATAAATGATGATTCTGAAATTGCTTTGTATGATATTATTGGAAGTCAAGCCCATACCATAATGTTACTTCAAAATAATATTATTACAAAAAATGATGCAAAAAAAATTTTATCCTCCTTGGAAAATCTGAAAAATGAAAAATTTGATTCTTCATCTGGAGCAGAAGATATTCATGAATTAATTGAATCTCTAGTAATTAAAAAAGCAGGTATGGCAAGTGGTGGAAAAATGCATACTGCAAGATCCAGAAATGATCAAGTTGTTTTAGATATTAGGATGAAAATTAG

.gff

gff-version 3

Sequence Data: seqnum=1;seqlen=59792;seqhdr="NODE_23_length_59792_cov_23.204747"

Model Data: version=Prodigal.v2.6.3;run_type=Metagenomic;model="39|Rickettsia_conorii_Malish_7|B|32.4|11|1";gc_cont=32.40;transl_table=11;uses_sd=1

NODE_23_length_59792_cov_23.204747 Prodigal_v2.6.3 CDS 1 147 19.5 - 0 ID=1_1;partial=10;start_type=TTG;rbs_motif=None;rbs_spacer=None;gc_cont=0.299;conf=98.71;score=18.89;cscore=30.86;sscore=-11.98;rscore=-0.99;uscore=-0.73;tscore=-9.61; NODE_23_length_59792_cov_23.204747 Prodigal_v2.6.3 CDS 523 1983 198.6 - 0 ID=1_2;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.300;conf=99.99;score=198.00;cscore=196.39;sscore=1.61;rscore=-0.99;uscore=0.35;tscore=2.90;

.tsv NODE_242_length_21397_cov_15.155046_23 95f056a25d64e5e8c5c3ed0f8d0d601d 140 TIGRFAM TIGR03618 Rv1155_F420: PPOX class probable F420-dependent enzyme13 130 3.0E-29 T 03-06-2022 IPR019920 F420-binding domain, putative NODE_242_length_21397_cov_15.155046_23 95f056a25d64e5e8c5c3ed0f8d0d601d 140 Pfam PF01243 Pyridoxamine 5'-phosphate oxidase 9 87 5.4E-103-06-2022 IPR011576 Pyridoxamine 5'-phosphate oxidase, putative NODE_242_length_21397_cov_15.155046_23 95f056a25d64e5e8c5c3ed0f8d0d601d 140 SUPERFAMILY SSF50475 FMN-binding split barrel 6 131 7.86E-23 T 03-06-2022 - -

danielzmbp commented 2 years ago

The problem seems that the .faa and the .fna have different headers, so it cannot match the cds to the protein. One way to rename the headers if the sequences in both are in the same order and there's no extra ones, would be to get first the headers in a list: awk 'sub(/^>/, "")' your_fasta.fna > headers.txt and then replace on the other one: awk 'NR == FNR { o[n++] = $0; next } /^>/ && i < n { $0 = ">" o[i++] } 1' headers.txt your_fasta.faa > your_new_fasta.fna I hope this is helpful, please let me know if there's further problems.

u15412611 commented 2 years ago

Thank you for your input. I have made changes as you directed and rerun the command. However, I still come across the same error : [Fri Jun 24 11:47:24 2022] Error in rule make_families: jobid: 6 output: families/faas

RuleException: CalledProcessError in line 58 of /home/percy/wsgups/Snakefile: Command 'set -euo pipefail; /root/anaconda3/envs/snakemake/bin/python3.10 /home/percy/wsgups/.snakemake/scripts/tmpyaticqp6.pillars.py' returned non-zero exit status 1. File "/home/percy/wsgups/Snakefile", line 58, in __rule_make_families File "/root/anaconda3/envs/snakemake/lib/python3.10/concurrent/futures/thread.py", line 58, in run Exiting because a job execution failed. Look above for error message Complete log: .snakemake/log/2022-06-24T114714.015808.snakemake.log

the proteinortho/protein_families.poff.tsv was produced as below:

Species Genes Alg.-Conn. Crenar_1501.faa Crenar_1502.faa

2 2 1 105_4 81_2 2 2 1 106_4 46_3 2 2 1 106_5 46_4 2 2 1 10_13 50_2 2 2 1 10_18 50_6 2 2 1 10_19 50_7 2 2 1 10_20 95_2 2 2 1 10_21 95_3 2 2 1 112_4 13_6

I am not sure waht could be the problem

danielzmbp commented 2 years ago

It seems you're analyzing two taxa. The default cutoff is of families of at least 5 members. You can change line 56 in the Snakefile to for example cutoff=0 and it should analyze it. However, the hyphy programs usually recommend to have a larger number of taxa to get more significant results so I would suggest to consider including other closely related taxa. Hope this helps.

danielzmbp / wsgups

Error in rule make_families: #7

gff-version 3

Sequence Data: seqnum=1;seqlen=59792;seqhdr="NODE_23_length_59792_cov_23.204747"

Model Data: version=Prodigal.v2.6.3;run_type=Metagenomic;model="39|Rickettsia_conorii_Malish_7|B|32.4|11|1";gc_cont=32.40;transl_table=11;uses_sd=1

Species Genes Alg.-Conn. Crenar_1501.faa Crenar_1502.faa