Column name handling: start/end

bschilder commented 2 years ago

Currently, we don't have start/end in the mapping file. This isn't a problem for 1bp-wide features like SNPs (in the strict sense of the term), but if you include indels/structural variants you can have both start/end columns spanning some range.

Might we want to come up with some way to rename these such that it's compatible with the full pipeline? i.e. we probably want to keep "start" as "BP" since this is a cornerstone of how MSS works. But perhaps when synonyms of "end" occur, we can rename that something like "BP2".

Mappings that come to mind:

--> "BP"

"pos1"
"position 1"
"position1"
"start"
"start position"
"position start"
"start pos"
"pos start"
"bp1"
"bp start"
"start bp"
"begin"

--> "BP2"

"pos2"
"position 2"
"position2"
"end"
"end position"
"position end"
"end pos"
"pos end"
"bp2"
"bp end"
"end bp

bschilder commented 2 years ago

Oh also, other minor edits to the CHR mapping that was updated recently. By adding:

--> "CHR"

"seqs"
"seqname"
"CHROMS"

Al-Murphy commented 2 years ago

This makes sense, we should probably only do it if the Indel parameter is set to true? Next time one of us is making changes to the dev branch it will probably be worth testing the effect of this. Do you kknow of any downstream software that uses start & end? What names do they require? Just want to make sure BP2 will be understandable to users and downstream applications

Al-Murphy commented 2 years ago

Yep happy to add the extra CHR mappings but put them in as upper case as all inputted headers are pushed toupper() anwyay (this doesn't really matter since the same is done to the entries in the mapping file but just so they are the same as what's there currently)

bschilder commented 2 years ago

This makes sense, we should probably only do it if the Indel parameter is set to true?

I think this makes sense currently given that MSS only covers SNPs/indels currently (not larger SVs).

SVs are a bit more complicated, but you can run GWAS with them very similar to the way you would run one with SNPs only. My former labmate did this in AD so if we eventually decide to go in that direction it would be great to get his input. @ricardovialle would you mind giving some initial thoughts on this?

Next time one of us is making changes to the dev branch it will probably be worth testing the effect of this.

Definitely!

Do you kknow of any downstream software that uses start & end? What names do they require? Just want to make sure BP2 will be understandable to users and downstream applications

That's a good point, I picked BP2 for brevity but the typical nomenclature is "start"/"end" when dealing with ranged data (based on GenomicRanges). So maybe "BPEND" would be closer? That seems less obvious though when it's all uppercase.

That said, anyone who wants to do downstream analysis with ranged data is likely going to convert toGRanges anyways, which fortunately is already an export options for MSS. So as long as MSS has a way of converting back and forth from data.table to GRanges format, that should cover most use cases. One other nice thing would be add BED as one of the write formats. That way, it can automatically be read in as a GRanges object by tools like rtracklayer::import.

Regarding the analysis softwares, I suppose it depends on how you want to use your sumstats. One that comes to mind is goshifter. It takes a set of ranged annotations and tests for enrichment against a list of non-ranged SNPs. In theory, you could input ranged annotations derived from a GWAS with SNPs/indels and run enrichment against a list of SNPs from another GWAS that only have single-bp SNPs (or some other source).

Yep happy to add the extra CHR mappings but put them in as upper case as all inputted headers are pushed toupper() anwyay (this doesn't really matter since the same is done to the entries in the mapping file but just so they are the same as what's there currently)

Makes sense!

Al-Murphy commented 2 years ago

I actually had a few other concerns around checks on this with the reference datasets. Probably worth discussing in person before we commit to adding the functionality (and maybe including Nathan too)

ricardovialle commented 2 years ago

SVs are a bit more complicated, but you can run GWAS with them very similar to the way you would run one with SNPs only. My former labmate did this in AD so if we eventually decide to go in that direction it would be great to get his input. @ricardovialle would you mind giving some initial thoughts on this?

Hi guys, this definitely would be something nice to have. I'm not aware of a specific standard for reporting SVs summary stats. Even the VCFs created by SV callers sometimes do not follow the same specification. The VCF format specification usually expect an END field. In our results, we reported pos and sv_end (link). Also, you might want to consider CIPOS and CIEND as many SVs have imprecise breakpoints.

bschilder commented 2 years ago

Thanks so much for the input @ricardovialle! Alan and I will chat about all this in our next meeting and we'll let you know the plan here.

Al-Murphy commented 2 years ago

Added extra CHR mappings

Al-Murphy / MungeSumstats