aitgon / vtam

MIT License
3 stars 3 forks source link

Make automatically a complete known_occurrences.tsv in filter #1

Closed meglecz closed 2 years ago

meglecz commented 4 years ago

When using several different mocks and negatives, plus samples from different habitats, the preparation of know_occurrences.tsv is fastidious.

It would be nice to prepare automatically a know_occurrences.tsv file when running filter. This file could be revised manually by the user afterwards, but it serves as a solid base.

Plan: run filter with two options --mock_composition: requires a tsv file in the format of know_occurrences.tsv, but only keep occurrences are listed in all mocks --sample_types: requires a tsv file as in the example bellow. All sample-run-marker combination should be listed

Marker Run Sample Sample_type habitat MFZR run1 tpos1_run1 mock terrestrial MFZR run1 tnegtag_run1 negatif NA MFZR run1 14ben01 real freshwater MFZR run1 14ben02 real freshwater

Based on these files prepare a known_occurrences.tsv with keep and delete occurrences as follows:

Keep occurrences: Copy of mock_composition.tsv

Delete Occurrences:

aitgon commented 4 years ago

One question. For the "habitat" delete part, if a different cutoff other that 0.5 must be used, then do we need a parameter?

Another question. What does this mean N_i' ? Sum of all other variants in the same habitat h? Sum of variant i in the other habitats?

My other suggestions are the following. I propose that this tool writes a "delete occurrences" file without "keep occurrences". The "keep occurrences" file is a different file created by the user. The following "vtam optimize" command will take two options --occurrences_keep and --occurrences_delete.

meglecz commented 4 years ago

One question. For the "habitat" delete part, if a different cutoff other that 0.5 must be used, then do we need a parameter? Ideally yes. We can call it min_habitat_proportion or habitat_proportion or habitat_p

_What does this mean Ni' ? Total number of reads on variant i in run-marker combination (Ni) minus the number of reads of variant i, where habitat is 'NA' (negative control).

_My other suggestions are the following. I propose that this tool writes a "delete occurrences" file without "keep occurrences". The "keep occurrences" file is a different file created by the user. The following "vtam optimize" command will take two options --occurrences_keep and --occurrencesdelete. That is a possibility, but generally you do not like adding new parameters. I would pefer keep and delete occurrences in the same file (known_occurrences.tsv), to keep the command as simple as possible.