danmaclean / gee_fu

An extensible Ruby on Rails web-service application and database for visualising HTGS data
18 stars 5 forks source link

GFF data for features #36

Closed mrship closed 11 years ago

mrship commented 11 years ago

@danmaclean In exporting the feature dataset with #to_gff I've come across Features that don't have a reference_id. That causes the code to blow up. Any thoughts as to why some features don't have reference_ids?

I'm wary of digging into this code too much as I don't really know what it does and there are no tests for it, so if we can work out a quick fix I can release a feature to write GFF data and complete the repository dump.

Let me know.

mrship commented 11 years ago

Also, for predecessor #to_gff I'm seeing empty strings returned - is that to be expected?

danmaclean commented 11 years ago

@mrship They should all have reference ids, For each feature a reference is found first. https://gist.github.com/danmaclean/5437571 This could return nil (perhaps if @experiment.genome_id is a string - playing on the console showed me I needed to use an int for reference_id in a Feature.find(:all).select {|f| f.reference_id == 1} )

This is added as an attribute and saved back, so presuming it found the proper reference then we should be ok.

Reasons it might not have found the proper reference,

  1. Reference.find returning nil and code not checking...
  2. Bio::GFF::GFF3 parse error such that record.seqname isn't == the name in reference. I wonder if Bio::GFF::GFF3 doesnt carp when passed an empty string or strings with just "\n", I may need to tighten up the loading code
  3. Earlier DB load errors - on my test site every feature has a proper reference_id when I checked the DB. So it seems to be going in fine. But if your GFF file has loads of empty lines that aren't being caught, then maybe that would mess it up

This gist shows the part of experiments_controller.rb responsible for reading the uploaded gff file https://gist.github.com/danmaclean/5437636 it isn't removing trailing newlines (Bio::Record::GFF::GFF3 should do this, but it isn't skipping newlines and perhaps everything is going through silently.

Looks like this last one is a good bet, this code gives a completely empty string object, rather than just die-ing

g = Bio::GFF::GFF3.new("\n")
 => ##gff-version 3
 .    . .   .   .   .   .   .   .   .

Im supposing that this will boil down to when we add data, rather than retrieval of data by Rails itself!

Predecessor to_gff shouldn't return empty strings, but I bet this is a follow on from the empty gff. Do any of your gffs have empty lines at the bottom?

danmaclean commented 11 years ago

@mrship Actually in the loading of the GFF each line from the gff file should be a Bio::GFF::GFF3::Record not a Bio::GFF::GFF3, these two are similar and give nearly the same results in this test, but may be messing up elsewhere

> g = Bio::GFF::GFF3::Record.new("\n")
 => .   .   .   .   .   .   .   .   .
mrship commented 11 years ago

OK, I'm going to look at wrapping some tests around the code so we can get to the bottom of it. It may be due to my testing with rogue data but until we can definitively (and easily) test the code it will difficult to see exactly where the problems lie. I'll crack on with that tomorrow.

mrship commented 11 years ago

OK, having had a head-scratching morning as I work through how Features are created, I have reverted to a simple test for a Feature where I import the test FNA and GFF and look at the output from #to_gff.

Under the old method, I get:

Chr1    TAIR9   three_prime_UTR 11649   11863   .   -   .   Parent=AT1G01030.1

Under your revised method, I get:

Chr1    TAIR9   three_prime_UTR 11649   11863   0.0 -   0   Parent=AT1G01030.1;gfu_id=60
  1. Are the differences significant?
  2. Where else might these differences catch us out?
  3. Should we leave sleeping dogs lie for now (i.e. work with the existing method) until we have a better test suite?

I'll continue to try and wrap some tests around the logic and wrap my head around it too!

danmaclean commented 11 years ago

GFF is a swear word in bioinformatics sometimes...

  1. No. They stem from differences in the interpretation of the GFF format between me and the author of Bio::GFF::GFF3::Record. GFF is a mess of a format and the new method appears to stick to the text of the format description, but not the examples or common practice. The gfu_id is just a bolt on I put in there because I thought it would be useful, it can be added later.
  2. They shouldn't. You wouldn't really use these in the app. I can't see us ever needing to refer to them other than in tests.
  3. OK. The original method looks more like GFF3 and makes more sense to a reader (and looks like the input GFF). We can move to the 'official' Bio::Ruby GFF3 later.
mrship commented 11 years ago

OK, I'll leave well alone for now then :smile:

I've got a working version of some very simple specs that have helped me to determine the problems with the rake repo:export task in outputting the GFF. I'll create a PR that reflects those changes.