bsmith89 / StrainFacts

Factorize metagenotypes to infer strains and their abundances
MIT License
11 stars 1 forks source link

Improve documentation of input formats #5

Open SilasK opened 1 year ago

bsmith89 commented 1 year ago

Check out the rendered demo notebook here: https://byronjsmith.com/StrainFacts/simulate_data.html (Apologies for the buried documentation.)

The last step in that example is to dump the simulated metagenotype to the TSV file format, which looks like this:

sample  position  allele  metagenotype
0       0         alt     0.0
0       0         ref     10.0
0       1         alt     10.0
0       1         ref     0.0
0       2         alt     10.0
0       2         ref     0.0
0       3         alt     10.0
0       3         ref     0.0
0       4         alt     10.0

(Note: the header is needed; all whitespace field-separators are tab characters)

TSVs in this format can then be loaded into the NetCDF format using sfacts load.

However, if you're using GT-Pro for metagenotyping, there's a --gt-pro flag that loads the metagenotype "parsed" output from that tool. (see https://github.com/zjshi/gt-pro#gt-pro-parsed-output)

jfy133 commented 1 year ago

It would be really nice to improve the documentation on this - while the simulated metagenotype example above is sort of useful, what the third column is exactly meant to represent to me is not at all clear. It would also nice to be able to see 'real' life examples rather than just simulated data.

I've also tried looking through the documentation for GT-Pro and MIDAS as they are suggested as possible metagenotypers - MIDAS does not seem to report such a format anywhere, and GT-Pro does not run on the data I would like to test StrainFacts on (hits an error).

So trying other (meta)genotypers is also somewhat not an option at the moment, as I don't know how to 'recreate' the metagenotyper column here from other tools.

(sorry for the small rant, but most of the tools I've tried installing today have failed/not documented, so feeling slightly frustrated 😅)

jfy133 commented 1 year ago

Just to continue with an example, in sfacts load --help

The flag --gtpro-metagenotype describes the followig:

  --gtpro-metagenotype GTPRO_METAGENOTYPE
                        Path of a metagenotype in 'merged' GT-Pro format; this is the same as output from `GT_Pro parse`, but (1) filtered to just one species, (2) with a sample_id column prepended, and (3) concatenated from N samples.
                        (default: None)

but 'merged' GT-pro output is not described on the GT-Pro documentation, so I assume this is actually referring to modified output from GT-Pro, and with a lot of modifications. An example of this would be helpful as it's not clear to me e.g. if I need to have a specific order of the columns? Do I need to drop any columns (e.g. the contigs column of the GT_Pro parsed output)?

bsmith89 commented 1 year ago

Hi @jfy133, thanks for the feedback on the documentation as it currently stands. Let me try to answer your question here, and this can be the basis for ongoing improvements to the documentation. I also think you might be indirectly suggesting that there should be import functions that operate directly on the raw outputs of the various metagenotyping tools (rather than requiring users to post-process those outputs). This is a good suggestion and I'll definitely look into it.

what the third column is exactly meant to represent to me is not at all clear

(I think this would be easier to explain with a visual, but I'll try laying it all out in explicit detail.)

The generic metagenotype TSV has the following columns:

The columns are as follows: 1: sample :: a unique identifier indexing each of the independent samples being metagenotyped 2: position :: a unique identifier indexing each of the genomic positions being metagenotyped 3: allele :: a unique identifier indexing the alleles that can observed at each position 4: metagenotype :: the tally of reads observed with each allele at each position in each sample

The columns names in the first row of this TSV are important but the fourth column can have any label; it won't matter. I believe the columns can be in any order, but I haven't tested this.

This is a "flattened" table format with three index columns followed by the observed count, representing a 3-dimensional array that has been compressed into a list of counts where each is indexed by a (sample, position, allele) triplet. Since metagenotype data can be very sparse, it's convenient that 0-values can be dropped entirely in this format.

Pertaining to your actual question, the third column is the allele index. In the case of SNP data, this could (theoretically) be a few different things. Perhaps the most obvious would be 'A', 'C', 'T', and 'G'. But, because bi-allelic SNPs are the most common and because using them greatly simplifies the model and computation, the current StrainFacts implementation operates only on bi-allelic sites. Therefore, at each site, the two bases found at that position are (potentially arbitrarily) assigned as "reference" and "alternative". This allows sites where the reference genomes have either an 'A' or a 'C' can be labeled the exact same way as sites with a 'T' or an 'A'.

On the other hand, metagenotype data can have more than two alleles at a site, and I've been considering future development extending StrainFacts to full four-allele SNPs. However, basically all of the current functionality of StrainFacts assumes this is how alleles are encoded. As a consequence, the entries in this column should only be alt or ref.

MIDAS does not seem to report such a format anywhere

Hmm...if you're pointing out that MIDAS doesn't natively produce this StrainFacts input format, yes, you are correct. This needs to be built by the user from the MIDAS outputs that do exist. This might be a bit involved, but I think the two key files would be each of the single-sample outputs and the aggregated SNP info across samples. The latter includes details of which sites appear to be bi-allelic and which bases to use as ref and alt.

GT-Pro is my preferred metagenotyper in part because it is designed specifically for bi-allelic sites, and I don't need to do the somewhat complex parsing to get everything aligned.

GT-Pro does not run on the data I would like to test StrainFacts on (hits an error).

I'm curious what error you are hitting with GT-Pro. It's been pretty robust in my experience. If you think you might be having compilation/installation issues, you might check out this Docker container which has been working for me. The Dockerfile can also be found here.

So trying other (meta)genotypers is also somewhat not an option at the moment, as I don't know how to 'recreate' the metagenotyper column here from other tools.

'merged' GT-pro output is not described on the GT-Pro documentation, so I assume this is actually referring to modified output from GT-Pro, and with a lot of modifications. An example of this would be helpful as it's not clear to me e.g. if I need to have a specific order of the columns? Do I need to drop any columns (e.g. the contigs column of the GT_Pro parsed output)?

The --gtpro-metagenotype flag is a convenience allowing you to do only minimal filtering/appending/concatenating to the input data. Here's some (untested) bash code describing what you could do with already "parsed GT-Pro output" for each sample to generate the input:

species=102506  # Extracting data for E. coli
for tsv in gtpro_output/*.parsed.tsv
do
    # Select only rows for the selected $species and prepend the filename as the sample identifier. 
    awk -v OFS="\t" species="$species" sample="$tsv" '$1==species {print sample, $0}' $tsv
done > metagenotype.gtpro_merged.tsv

And if that flag didn't exist, here's how you could generate the "generic" input table from the GT-Pro output:

species=102506  # Extracting data for E. coli
printf "sample\tposition\tallele\tmetagenotype\n" > metagenotype.tsv  # Write the header.
for tsv in gtpro_output/*.parsed.tsv
do
    # Print a separate line for the reference allele (count in column 7) and the alternative allele (count in column 8).
    # Select only rows for the selected $species and prepend the filename as the sample identifier.
    awk -v OFS="\t" species="$species" sample="$tsv" '$1==species {print sample, $2, "ref", $7 "\n" sample, $2, "alt", $8}' $tsv
done >> metagenotype.tsv

So trying other (meta)genotypers is also somewhat not an option at the moment, as I don't know how to 'recreate' the metagenotyper column here from other tools.

For those using another metagenotyper—or if you don't want to use bash/awk/etc. to do this parsing and reshaping—here's a general recipe for how I think about compiling this data:

  1. Run your preferred metagenotyper on each sample.
  2. Drop any positions that are not both bi-allelic and in the core genome (in the case of GT-Pro both of these things are pre-determined and built into the database, which is nice. For other tools you will probably need to pick core positions based on coverage variation across samples and bi-allelic sites based on some maximum occurence of the other bases.)
  3. Pick (arbitrarily if you'd like) one base to be the reference and one to be the alternative at each position. Again, GT-Pro has already done this for you.
  4. Load each sample as a 2-dimensional array, indexed by position and (edit: 2023-02-21) ~sample~ allele with values representing the observed count.
  5. For each position, relabel the two bases/alleles to ref and alt
  6. Stack your samples into a 3-dimensional array, indexing the new dimension by the sample name. (There are a bunch of different ways to do this, but the xarray or pandas libraries would be the most mainstream options in the Python world.)
  7. "Flatten" your array into one-dimension, taking pains to keep the indexes aligned correctly.
  8. Write a TSV with columns representing the the three indexes and the read counts.

(sorry for the small rant, but most of the tools I've tried installing today have failed/not documented, so feeling slightly frustrated 😅)

Totally understand this. I've definitely been in your shoes. Hopefully the above explanation helps you and I can add it to the documentation for other users as well.

I'd value your feedback on what parts of this explanation are most valuable. Also happy to answer any follow-up questions.

jfy133 commented 1 year ago

Hi @bsmith89 this is very very helpful and very detailed thank you very much!

A few clarifications on my original question (to better describe what was confusing me) as now looking through my original post I made a couple of mistakes (out of rush/frustration - again sorry about that).

  1. I meant actually the fourth column rather than the third column being unclear (ref/alt was intuitive to me, and you do mention already document you only allow bi-allelic positions), but you have already described the fourth column!
  2. I think the reason why I found the metagenotype column unclear/uninuititive (and even possibly now): you mention it is a tally of reads - which implies to me raw counts, and thus should be integers, however in the example data in the jupyter notebook are written floats 10.0, 9.0 etc.
  3. Another mistake I wrote was that it is actually CallM that I was having an issue with not GT-Pro, when I was trying to build a new database (as described here.

Otherwise, some follow up points:

I also think you might be indirectly suggesting that there should be import functions that operate directly on the raw outputs of the various metagenotyping tools (rather than requiring users to post-process those outputs).

To a certain extent yes this would be of course nice, but I was/am also happy to 'manually' create my own metagenotype TSV files from the output of other tools (e.g. maybe VCFs which are probably not standardised enough for you to make a generic inporter); but of course this required the definition of the columns - but you have now provided this I think :).

But certainly additional support in in strainfacts load would be a nice to have in the future :)

Lokoing again in the help message for load, you have --metagenotype --community --genotype flags, which all reference in standard, StrainFacts TSV format - it would be good to document and provide a link(?) to this in the help message - I think this also threw me off as I couldn't find such a 'standard' on the docs/python notebooks. By calling it 'standard' I would normally intepret that as having some prominent place in the documentation.

3: allele :: a unique identifier indexing the alleles that can observed at each position

Note, given your description it might be good to explicitly say something like 'a unique binary indexing of alleles', as StrainFacts currently only supports bi-allelic sites.

Hmm...if you're pointing out that MIDAS doesn't natively produce this StrainFacts input format, yes, you are correct. This needs to be built by the user from the MIDAS outputs that do exist. This might be a bit involved,

For this it was the README section that through me off there as you say 'are currently supported'. You have the --gt-pro convience flag which I shows StrainFacts can 'import directly' (almost it seems)., but would maybe rephrase to say 'output from the following tools can be imported, after some modifications to look like XYZ' (where XYZ is a link to the descriptions of the expected StrainFacts TSV input/or your nice little awk bash scripts).


Of course these are opinionated descriptions by me, but that is my 'user experience' of navigating :)

I hope to try again to get it to run over the next week or so, so I can report back with any other issues/things that are unclear.

Thanks again for the detailed response!

bsmith89 commented 1 year ago

you mention it is a tally of reads - which implies to me raw counts, and thus should be integers, however in the example data in the jupyter notebook are written floats 10.0, 9.0 etc.

Yep, this is indeed a bit of a bug with the export function, albeit one that doesn't affect operation too too much. I should definitely fix this, though, since TSVs are bigger than they should be and don't match dtype expectations. I've added issue #6 addressing it.

at it is actually CallM that I was having an https://github.com/zjshi/CallM/issues/4 with not GT-Pro, when I was trying to build a new database

Gotcha! I haven't tried CallM, but I expect Jason will be helpful when he is able to deal with your issue.

Lokoing again in the help message for load, you have --metagenotype --community --genotype flags, which all reference in standard, StrainFacts TSV format - it would be good to document and provide a link(?) to this in the help message - I think this also threw me off as I couldn't find such a 'standard' on the docs/python notebooks. By calling it 'standard' I would normally intepret that as having some prominent place in the documentation.

Note, given your description it might be good to explicitly say something like 'a unique binary indexing of alleles', as StrainFacts currently only supports bi-allelic sites.

but would maybe rephrase to say 'output from the following tools can be imported, after some modifications to look like XYZ' (where XYZ is a link to the descriptions of the expected StrainFacts TSV input/or your nice little awk bash scripts).

Of course these are opinionated descriptions by me, but that is my 'user experience' of navigating :)

Very helpful feedback. Thanks for taking the time to comment! These will be a priority as I refine the documentation.

P.S.: Looking forward to also hearing about the challenges you face in actually fitting the model. Based on my own personal experience, fiddling with model fitting hyper-parameters can also be frustrating. I've recently settled on a new "default" model and parameters, which I would suggest that you use as a first-attempt:

sfacts fit \
--device cuda \
--random-seed 0 \
--model-structure model4 \
--optimizer-learning-rate 0.05 \
--min-optimizer-learning-rate 1e-4 \
--hyperparameters gamma_hyper=1e-5 pi_hyper=1e-2 pi_hyper2=1e-2 rho_hyper=1e0 rho_hyper2=1e0 \
--anneal-hyperparameters gamma_hyper=0.999 \
--anneal-steps 20000 \
--num-strains ${nstrains} \
${inpath} ${outpath}

# $nstrains should be substantially (~2x?) greater than the actual number of strains.
# If you aren't training on a GPU, drop `--device cuda`.

However, note that model4 is similar but not the same as the model described in the StrainFacts paper (ssdd3_with_error/model1). I am currently working on a project/manuscript that will likely include a description of the updated model.