Preparing a DB for StrainGR

Hi,

Been testing StrainGE and been quite impressed so far. However, in testing on a complex sample I wanted to include more than one species in the reference DB, and I was confused about the suggested way to do so.

In particular, I'm confused why running StrainGR requires "StrainGST results on one or more samples". Shouldn't creating a refdb rely only on reference data i.e. not metagenomes?

I'm interested in starting with a set of reference genomes, and getting StrainGR to call variants from a metagenome de novo. Is there a recommend way of doing that?

I attempted the following, not specifying -s or -S which I believe is allowable reading the -h but perhaps am wrong:

$ ls references/
GCA_002369315.1_genomic.fna  GCA_002412705.1_genomic.fna
$ straingr prepare-ref -p "references/{ref}.fna" -o refs_concat.fna
2023-04-04 13:42:33,826 - INFO:root:Determining which reference strains to include...
2023-04-04 13:42:33,826 - INFO:root:Found 0 reference strains to include.
2023-04-04 13:42:33,826 - INFO:root:Checking file paths...
2023-04-04 13:42:33,826 - INFO:root:Path template: references/{ref}.fna
2023-04-04 13:42:33,826 - INFO:root:Creating concatenated reference...
2023-04-04 13:42:33,828 - INFO:root:Wrote FASTA file to refs_concat.fna
2023-04-04 13:42:33,828 - INFO:root:Analyzing repetitiveness of concatenated reference...
/mnt/hpccs01/work/microbiome/lorikeet/mess/1_single_genome_benchmark/.snakemake/conda/2970e699d06b7b43623728aaca037821_/lib/python3.10/site-packages/skbio/io/registry.py:547: FormatIdentificationWarning: <_io.TextIOWrapper name='refs_concat.fna' mode='r' encoding='UTF-8'> does not look like a fasta file
  warn("%r does not look like a %s file"
2023-04-04 13:42:33,829 - WARNING:strainge.variant_caller:Could not find a metadata file for reference %s, and therefore StrainGR has no sense of the repetitiveness of the concatenated reference. Abundance metrics may be skewed.
Traceback (most recent call last):
  File "/mnt/hpccs01/work/microbiome/lorikeet/mess/1_single_genome_benchmark/.snakemake/conda/2970e699d06b7b43623728aaca037821_/bin/straingr", line 11, in <module>
    sys.exit(straingr_cli())
  File "/mnt/hpccs01/work/microbiome/lorikeet/mess/1_single_genome_benchmark/.snakemake/conda/2970e699d06b7b43623728aaca037821_/lib/python3.10/site-packages/strainge/cli/main.py", line 110, in __call__
    self.run(args)
  File "/mnt/hpccs01/work/microbiome/lorikeet/mess/1_single_genome_benchmark/.snakemake/conda/2970e699d06b7b43623728aaca037821_/lib/python3.10/site-packages/strainge/cli/registry.py", line 83, in run
    rc = subcommand_func(**args_dict)
  File "/mnt/hpccs01/work/microbiome/lorikeet/mess/1_single_genome_benchmark/.snakemake/conda/2970e699d06b7b43623728aaca037821_/lib/python3.10/site-packages/strainge/cli/straingr.py", line 237, in __call__
    repeat_masks = analyze_repetitiveness(str(output), minmatch)
  File "/mnt/hpccs01/work/microbiome/lorikeet/mess/1_single_genome_benchmark/.snakemake/conda/2970e699d06b7b43623728aaca037821_/lib/python3.10/site-packages/strainge/variant_caller.py", line 385, in analyze_repetitiveness
    p = subprocess.run(cmd, capture_output=True, text=True)
  File "/mnt/hpccs01/work/microbiome/lorikeet/mess/1_single_genome_benchmark/.snakemake/conda/2970e699d06b7b43623728aaca037821_/lib/python3.10/subprocess.py", line 505, in run
    stdout, stderr = process.communicate(input, timeout=timeout)
  File "/mnt/hpccs01/work/microbiome/lorikeet/mess/1_single_genome_benchmark/.snakemake/conda/2970e699d06b7b43623728aaca037821_/lib/python3.10/subprocess.py", line 1154, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
  File "/mnt/hpccs01/work/microbiome/lorikeet/mess/1_single_genome_benchmark/.snakemake/conda/2970e699d06b7b43623728aaca037821_/lib/python3.10/subprocess.py", line 2047, in _communicate
    stderr = self._translate_newlines(stderr,
  File "/mnt/hpccs01/work/microbiome/lorikeet/mess/1_single_genome_benchmark/.snakemake/conda/2970e699d06b7b43623728aaca037821_/lib/python3.10/subprocess.py", line 1031, in _translate_newlines
    data = data.decode(encoding, errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 115: invalid start byte

I'm using the conda version FWIW.

Thanks, ben

broadinstitute / StrainGE

Preparing a DB for StrainGR #29