DEIB-GECO / GMQL

GMQL - GenoMetric Query Language
http://www.bioinformatics.deib.polimi.it/geco/
Apache License 2.0
18 stars 11 forks source link

Dataset coordinate system #27

Closed marcomass closed 7 years ago

marcomass commented 7 years ago

In the tag gmqlSchema of a dataset schema xml file add the attribute coordinateSystem (which can add values 0-based oe 1-based). According to such attribute, manage properly dataset input and output in GMQL, by correctly translating (if need) to/from the 0-based coordinate system used within GMQL,. If no coordinateSystem attribute is specified in the dataset schema, as default use the value 0-based.

Standard format have their predefined coordinate system, for:

https://www.biostars.org/p/6373/ https://genome.ucsc.edu/FAQ/FAQformat.html#format13

marcomass commented 7 years ago

In GMQL, update accordingly the default schemas (and their xsd) for the supported standard format (narrow/broadpeak, bed, bedgraph, gtf, vcf), which are used when the user add to the repository his/her own dataset. Update also the schema of the public datasets in the repository (as well as the reference schema used in Jorge software to update the repository.

marcomass commented 7 years ago

@OlgaGorlova Can you provide an example of how to specify the coordinateSystem attribute and value in the dataset schema?

OlgaGorlova commented 7 years ago

@marcomass You do not have to specify it - GTF and VCF formats by default are read as 1-based, others as 0-based

marcomass commented 7 years ago

@OlgaGorlova Ok, good. Yet, what about data in general tab-delimited format? Being user-defined data, they could have either of the two coordinate systems; so it is needed the possibility to define in their schema which is the coordinate system they use. Has this been enabled? If/when yes, please close this issue I reopened.

marcomass commented 7 years ago

@acanakoglu @OlgaGorlova Can you provide an example of how to specify the coordinateSystem attribute and value in the dataset schema?

marcomass commented 7 years ago

@OlgaGorlova @acanakoglu I reopened this issue, since it is needed that the user can specify in the xml schema of a dataset regarding not standard data format (i.e., tab-delimited) which is the coordinate system used by the dataset data. Please implement it.

acanakoglu commented 7 years ago

@OlgaGorlova, There was a bug in the creation of schema and I corrected and committed. However, there is another problem to correct.

In the output, you are not passing coordinate system parameter to CLI? And also, are you setting the default type in the output for TAB and GTF as you mentioned above?

You do not have to specify it - GTF and VCF formats by default are read as 1-based, others as 0-based

If you need more explanation, please let me know.

OlgaGorlova commented 7 years ago

@acanakoglu, Thank you!

In the output, you are not passing coordinate system parameter to CLI?

Yes, I committed changes to https://github.com/DEIB-GECO/GMQL/tree/Coordinate_System . Could you please try it?

And also, are you setting the default type in the output for TAB and GTF as you mentioned above?

Yes, if you do not specify the coordinate system, then it will use default type.

acanakoglu commented 7 years ago

I tried and with my test everything is ok. You can merge into the main branch.