implement GFF3 in python-apollo

nathandunn commented 4 years ago

Implement like https://github.com/galaxy-genome-annotation/python-apollo/pull/25

[ ] propagate test_suite so that all methods in load_gff3 have use test flag:
- [ ] add_attribute
- [ ] set_translation_start/end
- [ ] set_name
- [ ] scan others
[ ] add tests that reflect the added bootstrap commands for test: 0 annotations for
[ ] add tests that reflect the added bootstrap commands for disable_cds . . etc: identical to what was there before I guess?
[ ] add tests that reflect the added bootstrap commands for use_name: identical to what was there before I guess
[ ] add to test suite
[x] test scripts to run
- export ARROW_GLOBAL_CONFIG_PATH=/Users/nathandunn/repositories/python-apollo/test-data/local-arrow.yml and ./bootstrap_apollo.sh --nodocker
[x] integrate with the arrow command
[x] implement all of these features within the existing script
- [x] test
- [x] disable_cds_recalculation
- [x] use_name_for_feature
- [x] review other features

===

[x] integrate the python script into the apollo deploy script
[ ] write a new script to use the libraries to do this
[ ] update requirements
[ ] integrate into the tests

nathandunn commented 4 years ago

Current features:

GetOptions("input|i=s"          => \$input_file,
           "username|u=s"       => \$username,
           "password|p=s"       => \$password,
           "url|U=s"            => \$url,
           "gene_types_in|g=s"      => \$gene_types_in,
           "pseudogene_type_in|n=s" => \$pseudogene_types_in,
           "transcript_types_in|t=s"    => \$transcript_types_in,
           "exon_types_in|e=s"      => \$exon_types_in,
           "organism|o=s"      => \$organism,
           "cds_types_in|d=s"       => \$cds_types_in,
           "ontology|O=s"       => \$ontology,
           "gene_type_out|G=s"      => \$gene_type_out,
           "pseudogene_type_out|N=s" => \$pseudogene_type_out,
           "mrna_type_out|M=s"    => \$mrna_type_out,
           "transcript_type_out|T=s" => \$transcript_type_out,
           "exon_type_out|E=s"      => \$exon_type_out,
           "cds_type_out|D=s"       => \$cds_type_out,
           "property_ontolgy|R=s"   => \$property_ontology,
           "comment_type_out|C=s"   => \$comment_type_out,
           "property_type_out|S=s"  => \$property_type_out,
           "track_prefix|P=s"       => \$annotation_track_prefix,
           "disable_cds_recalculation|X"   => \$disable_cds_recalculation,
           "success_log|l=s"        => \$success_log_file,
           "error_log|L=s"      => \$error_log_file,
           "skip|s=s"           => \$skip_file,
           "test|x"           => \$test,
           "help|h"         => \$help,
           "name_attributes=s"   => \$name_attributes,
           "use_name_for_feature|a" => \$use_name_for_feature);

nathandunn commented 4 years ago

https://python-apollo.readthedocs.io/en/latest/commands/annotations.html#load-gff3-command

nathandunn commented 4 years ago

Will just need to do a few of the options:

 —test (easy enough), and disable_cds_recalculation and —use_name_for_feature
lol

nathandunn commented 4 years ago

Looking more closely, I think the differences between the python and original perl scripts are pretty different, though they share some commonalities (lookoing here: https://github.com/galaxy-genome-annotation/python-apollo/pull/25)

I think it would make more sense to do this with a clean slate for a newer apollo developed around the OGS calculations.

The fundamental difference is the perl file accumulates the JSON and then sends (or not) the accumulated features all at once, processing features, transcripts, and variants separately. The python script is more focused on specific use-cases and only works at the features level (which is probably sufficient), doing writes to adjust the names, attributes, as it goes.

hexylena commented 4 years ago

If we could swap the python to use a more normal way of doing things, accumulating + sending, that's fine for me! I believe the 'write to adjust the names' was a bug/oddity workaround ;)

The only concern I'd have is that either it's 100% success of 0%. As long as you run the creation in a database transaction, and rollback in case of error creating one of the features, we'd be happy to use a bulk API on the python side.

But of course clean slate sounds fine too, let's just identify the best implementation and implement on both sides?

(not that I have time to work on this while in sabbatical)

nathandunn commented 4 years ago

@hexylena I'm happy to keep working on it while you're on sabbatical:

The only concern I'd have is that either it's 100% success of 0%. As long as you run the creation in a database transaction, and rollback in case of error creating one of the features, we'd be happy to use a bulk API on the python side.

I agree that makes sense.

My only thought is that doing this using the current API will work with reasonably small numbers <10K. More than that, regardless of the backend, I should open up an API that writes directly to SQL as doing this via hibernate is going to be painful.

So we would need to:

[ ] emulate the perl that is already written (with the far fewer options), but using the current python framework in place
[ ] write a bulk loading API on the Apollo side that writes SQL directly

nathandunn commented 4 years ago

Let me know what you think.

nathandunn commented 4 years ago

FYI https://github.com/abretaud/migrate_apollo_db/

nathandunn commented 4 years ago

https://github.com/GMOD/Apollo/issues/2408

GMOD / Apollo

implement GFF3 in python-apollo #2410