NVIDIA / VariantWorks

Deep Learning based variant calling toolkit - https://clara-parabricks.github.io/VariantWorks/
Apache License 2.0
44 stars 11 forks source link

Feature request: BEDPE format for ingesting SV data #144

Closed mp15 closed 3 years ago

mp15 commented 3 years ago

Hi,

I am working on a structural variant breakpoint cluster classifier for cancer samples and was wondering whether you had a BEDPE format parser in the offing or whether you planned to write one in the near future?

Martin

tijyojwad commented 3 years ago

Hi Martin!

We don't have one yet, but we can certainly add a BEDPE parser. Do you have a specific format in mind in which you'd like it exposed? Are you essentially envisioning an encoder based on the BEDPE entry instead of a Variant entry that you can use to build the encoding?

mp15 commented 3 years ago

Yes please. This format, I also have an example. I want to be able to take breakpoints from the BEDPE file and then classify either them or groups of them (I'm planning to feed groups of breakpoints to a RNN).

tijyojwad commented 3 years ago

Fantastic, the example is very helpful. Looks quite straightforward, we can have a PR up within a few hours!

tijyojwad commented 3 years ago

Hi Martin, I've created PR #145 with a generic BED parser, and used your sample data as a test case :). The API usage is shown here - https://github.com/clara-parabricks/VariantWorks/pull/145/files#diff-1beff6bb5395f5d2aa83d24579f5fb764dd0091a4bab06ee9fbba585e6f9a442R24

Can you have a look to see if this would fit your needs?

mp15 commented 3 years ago

Thanks.

Sorry was dealing with my PhD first year viva. I have tested this and have some notes. #152 contains a bugfix for a trivial bug I spotted in the strong typing code.

Unfortunately the data produced by our BRASS pipeline uses a slightly different header format and unfortunately duplicates two of the header labels. I managed to find an example of this as well at https://github.com/cancerit/BRASS/blob/dev/perl/testData/BrassMarkedGroups_test.out.bedpe I have proposed a patch #153.

tijyojwad commented 3 years ago

Hi @mp15 - looking through the column names in the example bedpe, I don't see a duplicated column name. Was that the right link?

mp15 commented 3 years ago

tmp.txt Here is a better sample, this is an actual header from one of our pipelines (edited to remove sensitive bits). As you can see we have two strand1 and strand2 fields, eww.

mp15 commented 3 years ago

Resolved by #153