PROBIC / BIB

Bayesian Identification of Bacteria
MIT License
6 stars 1 forks source link

ERROR: ArgumentParser: argument name phi unknown. #2

Closed smb20200615 closed 3 years ago

smb20200615 commented 3 years ago

Hello,

I am trying the last step ./BitSeq/estimateVBExpression -o abundance alignment_info.prob and get:

N mapped: 6617084 N total: 8910243 All alignments: 14131584 Isoforms: 18

End: bound decrease iter(s): 4 bound: -185867736.563 grad: 0.0420366 beta: 0.0001973 ERROR: ArgumentParser: argument name phi unknown.

Do you know why this could be?

Many thanks

ahonkela commented 3 years ago

Fixed in BitSeq master branch now.

If you are planning to use BIB, I would highly encourage you to look at mSWEEP (https://github.com/PROBIC/mSWEEP), which is much faster and provides accurate results for a much broader range of bacterial species.

smb20200615 commented 3 years ago

@ahonkela thank you so much for the fix. I am curious by what metric mSWEEP is meant to be better. I tried running both mSWEEP and BIB on a single Staph species mock dataset and mSWEEP significantly underperformed compared to BIB. I would really appreciate your thoughts as I am a bit confused by this result. The main issue was that mSWEEP detected things that weren't in the mock community (false positive).

ahonkela commented 3 years ago

Very interesting. In our experience BIB worked really well for species with a strongly clustered population structure such as Staph aureus, where a single reference strain is sufficient for representing a lineage, but struggled for species with less clear clustering of strains into lineages. mSWEEP solves this by using more reference strains to represent a lineage, which helped significantly with Staph epidermidis and many non-Staph species we have tried.

That said, if you have good references for all lineages you expect to see in the sample, BIB may work better, because mSWEEP makes some simplifications to make things run faster.

If you have not done so yet, I would highly recommend benchmarking the methods in a situation where the strain the reads are coming from is not included in the reference, as that can make a big difference.

smb20200615 commented 3 years ago

@ahonkela thank you so much for your thorough explanation/insights. Could mSWEEP be underperforming in my S. epidermidis mock community because I provided only one reference genome per sequence type? (The same genomes used to make the mock community were provided to mSWEEP). Thank you for your suggestion about benchmarking BIB and mSWEEP using genomes not used in the mock community.

Another question is that I used one genome per sequence type in my initial trial run of mSWEEP. In the second attempt, I ended using hundreds of reference genomes across different sequence types (I provided their sequence type affiliation via the cluster indicator txt file). I am getting this error:

mSWEEP-v1.4.0-2-g7691c1a abundance estimation
Parsing arguments
Reading the input files
  reading group indicators
  read 1571 group indicators
  reading pseudoalignments
Reading the input files failed:
  grouping has more reference sequences than the pseudoalignment.

Do you possibly know what could be the issue? (is it because reference collection is too big?)

Many thanks for your guidance!

ahonkela commented 3 years ago

mSWEEP is built around the idea of using several reference sequences for each lineage to better capture the diversity within the lineage. Using it with just one reference for all lineages would be highly suboptimal.

I am not familiar with the error you are seeing with mSWEEP. I would suggest double checking that you are using the correct pseudoalignment index with the exact set of sequences in the grouping, and opening a new issue with mSWEEP if the error persists.

tmaklin commented 3 years ago

Hi @smb20200615,

@ahonkela asked me to comment on the mSWEEP related issues.

If you have high-quality reference genomes available for all organisms used in your mock community, and include only them in the reference, then it is perfectly possible for BIB to outperform mSWEEP since BIB uses a model that is designed to handle precisely this situation.

mSWEEP is instead designed to solve cases where the real genome is not available in the set of reference sequences but genomes of related organisms from the same lineage are. This means that for mSWEEP to perform similarly to BIB in a mock community analysis, mSWEEP would likely require the addition of more reference sequences in order to enable the model to differentiate between false and true positives. Regardless, it probably makes more sense to use BIB for mock community analyses.

If I remember right S. epi is also a bit difficult species for mSWEEP to handle since the STs don't form as strong lineages as can be seen in some other species but I haven't tried running the analysis on a reference set that is as large as yours seems to be. If you have the time and willingness to tinker with the programs, it might also be worth a try to cluster the reference sequences with PopPUNK and to use the resulting grouping - which should be roughly analogous to clonal complexes - as input to mSWEEP rather than the ST level grouping.

As for the error you're seeing when running mSWEEP, it means that your indicators file (the argument supplied with the -i option) has more lines than the .fasta file that was pseudoaligned against has genomes/contigs (i. e. lines starting with '>' followed by the nucleotide sequence). If you run grep "^>" reference_sequences.fasta | wc -l and wc -l indicators_file.txt then these two numbers should be the same; otherwise mSWEEP won't run since the results would be nonsense.

Best, Tommi