diskin-lab-chop / AutoGVP

17 stars 3 forks source link

Filter multianno, autopvs1, intervar files before loading in R #143

Closed rjcorb closed 1 year ago

rjcorb commented 1 year ago

Purpose/implementation Section

What feature is being added or bug is being addressed?

Closes #142. This PR adds code to filter all annotation files for those positions contained in filtered vcf. These filtered files are used as input for 01-annotate_variants_CAVATICA_input.R and 01-annotate_variants_custom_input.R.

What was your approach?

What GitHub issue does your pull request address?

142

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Please review updated code logic and run on test pbta and test custom files to ensure output looks as expected.

Is there anything that you want to discuss further?

Note that, since we are only filtering by prepping for positions in filtered vcfs in the annotation files, we are not filtering for the variants themselves per se. There are certainly some cases where rows are retained because a non-position column in the vcf file matches a position in the filtered vcf file. But, this approach is still effective in reducing overall memory usage of AutoGVP by ~40%. I am open to other suggestions on how to improve this filtering!

rjcorb commented 1 year ago

I think grepping for ^ would require the positions to be at the start of the line, but since the Chr column is first, this wouldn't successfully filter in this case.