GELOG / adam-ibs

Ports the IBS/MDS/IBD functionality of Plink to Spark / ADAM
Apache License 2.0
3 stars 6 forks source link

import text dataset (--make-bed) into Spark #2

Closed davidonlaptop closed 8 years ago

davidonlaptop commented 9 years ago

Description

Similar to plink --make-bed option. See the wiki on IBS-MDS Process.

The input files are PED and MAP. However, a relational model similar to the FAM - BED - BIM class diagram is better, and should be used internally.

Analysis

Add a comment to this issue with:

Add a comment to this issue describing how this will be implemented in Spark, and how it differs from plink.

Also update the class diagram on the wiki page describing PLink formats (when incomplete) and add a class diagram describing the models implemented in Scala for this feature on the wiki page on the MGL804 formats.

Implementation

The implementation should use:

Important note: The model can be only in memory for now, but you'll need to integrate into the ADAM format later on. The relevant records from the ADAM model are Variant and Genotypes, but some fields are missing and will need to be added.

iki-v commented 9 years ago

L’implémentation est presque terminé.

Point bloquant : La documentation Indique : "The file test.bim is the extended map file, which also includes the names of the alleles", les 2 champs supplémentaires sont : Reference Allele et Alternate Allele. Question : D'où vient cette information (2 champs), comment elle est déterminé ?

davidonlaptop commented 9 years ago

@ikizema : Next time, please tag me if the question is for me. Then, I think I will receive a distinct notification and see your question quickier.

I'll answer by giving pointers to the diagrams at https://github.com/GELOG/adam-ibs/wiki/Plink-File-Formats.

An allele is represented by a letter (A,C,T,G) at a specific position on a chromosome. Since chromosomes comes in pair (one from dad, one from mother), there is 2 alleles for a specific position.

Since the scientific community needed to agree on a reference allele, only one allele was arbitrary chosen from the reference genome.

When sequencing a patient however, we need both alleles (allele 1 from mother, allele 2 from father, or vice versa) to determine the genotype of the person. This is stored in the BED record. If one allele of the patient differs from the reference allele, it is called a variation (or variant) (the other allele is called alternate allele. This variation can be shared by many individuals, that is why it is stored in the BIM Record.

Does this answers the question?

iki-v commented 9 years ago

Hi @davidonlaptop, thank you for the answer.

The question was also for anyone who could have an answer.

I'm understanding the concept of reference and alternate allele in general. If we are looking on "Genotype & Variant formats" text and binary mode, there are some extra data added in the binary mode. It concern reference and alternate entries in the Bim records. My concern was to understand why and how this extra information comes up ?

davidonlaptop commented 9 years ago

I don't know why there are not present in the MAP file. PLink probably retrieves them from the variant identifier, usually a string in the format "rs######" where "#" is a number. This id uniquely identifies a variant (and allele information) and can be found in many online databases. I suppose that plink uses a similar database somewhere in the code to add these columns to the BIM file.

iki-v commented 9 years ago

Data format information : BED format, BIM Format, FAM Format

The 5th column in .bim file (Reference Allele) is A1, the minor allele. The 6th column in .bim file (Alternate Allele) is A2, the major allele.

iki-v commented 9 years ago

Github is updated. BED, BIM and FAM formats are calculated. To Do : Save the information to .bed, .bim and .fam files. Should be done in one standard way with task #41.

iki-v commented 9 years ago

A modifier l’implémentation pour mapper dans le format ADAM. Prérequis : finalisation #43.