ashishbaghudana / mthesis-ashish

MIT License
0 stars 1 forks source link

GNormPlus <-> TagTog #27

Closed ashishbaghudana closed 8 years ago

ashishbaghudana commented 8 years ago

The purple highlighted ones are transcription factors. The green highlighted ones general proteins/genes.

Does this seem okay?

sample1

sample2

juanmirocks commented 8 years ago

Seems about right, but:

juanmirocks commented 8 years ago

But:

ashishbaghudana commented 8 years ago

https://github.com/ashishbaghudana/PubTator2Anndoc/

You can use this script to convert PubTator format to Anndoc. I've documented the module fairly well, so it should be easy to use.

ashishbaghudana commented 8 years ago

https://pypi.python.org/pypi/PubTator2Anndoc/0.1.0

Also updated on the PyPi repository.

ashishbaghudana commented 8 years ago

A small bug on TagTog, possibly because the text is HTML based. If any sentence contains double spaces, for instance, "ChIP analysis revealed that the CLOCK  BMAL1  CRY1 complex strongly occupies the promoter region of Gm129", HTML formatting automatically converts multiple spaces to a single space. I guess, the best way to overcome this would be to convert all spaces to  

I didn't think of this when I wrote the conversion script, I hadn't expected cases where there would be double spacing, but this messes with the annotation in JSON.

juanmirocks commented 8 years ago

@ashishbaghudana awesome for the public python package! :)

As for the possible bug. The current behavior was well thought of and is intentional. Certainly there may be a use case for preserving these spaces, but I cannot review this as of now in tagtog. I guess you could try to have the option in your own converter to either 1) preserve spaces (as of now, disregarding possible unalignments) or 2) delete extra spaces and work out the correct alignments. I think it should be doable.

Note that as of now any sequence of whitespace characters (including tabs) are substituted for 1 simple white space.