Motivation
The Genomic Data Structure (GDS) is a space efficient file format for storing variant
information with many of the same benefits as VCF. These code changes implement
a GDS converter module.
ChangesChanges to converters
The converters folder in open-cravat-modules-karchinlab now contains an additional
folder. Called gds-converter which contains the gds converter module along with
other expected files. The gds converter outputs variant information in the expected
format. The general structure of the gds converter is as such. The python script will
generate an R subprocess using the python rpy2 library. From there, the input GDS
file will be passed to the R subprocess. The R subprocess then uses a R library
equipped to efficiently read GDS file. The library is called SeqArray. Relevant variant
information is then access and parsed and returned to the main python script. The
python script then reformats the variant information into an expected formats and
yields the result.
RelevanceUse of R, rpy2, and SeqArray
The use of these additional resources was a result of certain technical difficulties
including:
GDS not having (publicly available) documented file specifications
Other more streamlined and python-based resources being outdated (pygds)
Using low-level C based implementations of this converter is possible but would be unmaintainable
Motivation The Genomic Data Structure (GDS) is a space efficient file format for storing variant information with many of the same benefits as VCF. These code changes implement a GDS converter module.
Changes Changes to converters The converters folder in open-cravat-modules-karchinlab now contains an additional folder. Called gds-converter which contains the gds converter module along with other expected files. The gds converter outputs variant information in the expected format. The general structure of the gds converter is as such. The python script will generate an R subprocess using the python rpy2 library. From there, the input GDS file will be passed to the R subprocess. The R subprocess then uses a R library equipped to efficiently read GDS file. The library is called SeqArray. Relevant variant information is then access and parsed and returned to the main python script. The python script then reformats the variant information into an expected formats and yields the result.
Relevance Use of R, rpy2, and SeqArray The use of these additional resources was a result of certain technical difficulties including: