dms-vep / dms-vep-pipeline-3

Pipeline for analyzing deep mutational scanning (DMS) of viral entry proteins (VEPs)
Other
2 stars 0 forks source link

summary.csv filtering #113

Closed Bernadetadad closed 7 months ago

Bernadetadad commented 8 months ago

I thought summary.csv supposed to have filtered data based on the params set in summaries_config.yml But for example here a summary.csv file from flu repo still has some sera selection values for mutations that have lower than -3 functional score in 293T cells (even though in plots those mutations are greyed out)? This is the case for summaries in other repos as well. Am I misunderstanding what's supposed to be in summary.csv file?

jbloom commented 7 months ago

@Bernadetadad, The output CSVs for the summaries (note there is also a per-sera output) have applied the times_seen and min_frac_models filters, as well as any other filters in the le_filters specification. These are construed as filters on data quality, and are applied to all data before writing the CSVs and making any plots (at least, that should be what happens---and looking at the code I think it is, raise an issue if not).

But it does not apply things like the init_min_value sliders for various properties: those are sliders that set what is shown by default, but do not filter what is in the CSV as the plot can be re-adjusted to show them.

So right now there is a distinction between hard filters applied to the data before it even goes into the plot, and the sliders.

It may be possible to apply filters like the functional scores one as hard filters using le_filters, do you want me to look into that and post an example?

Note also (possibly related) that I have an issue open to create more than one summary plot so you can have multiple differently configured ones. If that is a priority that would help you, make a comment in that issue and I will try to prioritize it.

jbloom commented 7 months ago

I'm going to close this issue. I think the current behavior is correct, and not really a bug. That is because there is a conceptual difference between filters that remove low-quality measurements, and mutations that are deleterious to cell entry. We show them differently on the plots: the former are shown as missing, the latter are grayed out.

I think if you want to remove them from the CSV, write a separate rule to do that.

Re-open and explain if you disagree, @Bernadetadad