UCLOrengoGroup / cath-todo

Issues relating to the CATH protein structure classification web pages
4 stars 0 forks source link

Discrepancy of versions in CATH-B? #16

Closed sillitoe closed 7 years ago

sillitoe commented 7 years ago

The latest release files mention "v4_0_0" (need to investigated why it's not "v4_1_0")

http://download.cathdb.info/cath/releases/daily-release/newest/

tonyelewis commented 7 years ago

Ah yes. I should have done that and documented it as part of the release process.

Let me have a look at this now. I'll aim to do a simple "hand-edit this file" version that you can then improve as you see fit.

sillitoe commented 7 years ago

Thanks, this was meant for my own docs (I should've self assigned).

Great if you're okay to look into it.

tonyelewis commented 7 years ago

Progress update:

The relevant crontab entry is at line 31 of puppet/trunk/files/crontabs/update/cathupdate.

That's running update/trunk/database/cath_b.sql on the database. I think the relevant parts of that are:

  CASE WHEN dh.domain_id IS NULL THEN 'v4_0_0' ELSE 'putative' END,

...

          event_timestamp   >= '2013-05-10'
tonyelewis commented 7 years ago

Right. I've:

I think it's worth leaving this ticket open for a day to check the overnight cath_b dump is as expected.

Further improvements you could consider:

sillitoe commented 7 years ago

Hmmm... would be good to set version as ENV from crontab, but not if we stick to raw SQL.

Maybe pass the SQL through something that changes "VERSION" to "v4_1_0" before it all gets piped into psql?

sillitoe commented 7 years ago

(sorry, didn't see your most recent comment when I posted that last post)

sillitoe commented 7 years ago

implement that via a releases table in the DB (a table we've discussed many times 😐)

this would be my preferred option.

Thanks for sorting - I'll close when the file looks okay tomorrow.

tonyelewis commented 7 years ago

(sorry, didn't see your most recent comment when I posted that last post)

NP. Indeed, taking the version & date out of the SQL into the crontab would be a big step in the right direction and would probably be much quicker than adding new DB tables etc.

It seems you can do this sort of thing:

> cat /tmp/bob.sql
SELECT COUNT(*) FROM domain WHERE flow_stage_type = :dom_fst;
> psql -A -t -F ' ' cathdb_current --variable=dom_fst="'ASSIGNED'" --file /tmp/bob.sql
421066
> psql -A -t -F ' ' cathdb_current --variable=dom_fst="'HOLDING_PEN'" --file /tmp/bob.sql
22211

(where the -A -t -F ' ' is just to simplify the output to a single number)

Nice :sunglasses: - I think that could be all you need here.

tonyelewis commented 7 years ago

This is looking sensible to me...

Numbers of lines in files from the night before last:

420420 cath_b.20161130.all
235310 cath_b.20161130.latest_release
185110 cath_b.20161130.putative

...and from last night:

421494 cath_b.20161201.all
308898 cath_b.20161201.latest_release
112596 cath_b.20161201.putative

To me, that looks like a sensible shift of numbers from putative to latest_release. Also, the version in the files seem sensible:

cath_b.20161130.all:101mA00 v4_0_0 1.10.490.10 0-153:A
cath_b.20161130.all:102lA00 v4_0_0 1.10.530.40 1-162:A
cath_b.20161130.all:102mA00 v4_0_0 1.10.490.10 0-153:A
cath_b.20161130.latest_release:101mA00 v4_0_0 1.10.490.10 0-153:A
cath_b.20161130.latest_release:102lA00 v4_0_0 1.10.530.40 1-162:A
cath_b.20161130.latest_release:102mA00 v4_0_0 1.10.490.10 0-153:A
cath_b.20161201.all:101mA00 v4_1_0 1.10.490.10 0-153:A
cath_b.20161201.all:102lA00 v4_1_0 1.10.530.40 1-162:A
cath_b.20161201.all:102mA00 v4_1_0 1.10.490.10 0-153:A
cath_b.20161201.latest_release:101mA00 v4_1_0 1.10.490.10 0-153:A
cath_b.20161201.latest_release:102lA00 v4_1_0 1.10.530.40 1-162:A
cath_b.20161201.latest_release:102mA00 v4_1_0 1.10.490.10 0-153:A

@sillitoe : :nose: right to you?

sillitoe commented 7 years ago

Looks great, many thanks

tonyelewis commented 7 years ago

Ta.

Just to check: did you notice this previous comment? (Don't want to sound like a high-maintenance attention-seeker; just want to check that something potentially time-savingly useful hasn't fallen through the cracks.)

sillitoe commented 7 years ago

Good point, thanks (you high-maintenance attention seeker).

Will close when that's sorted.

sillitoe commented 7 years ago

@tonyelewis was there a reason you chose:

event_timestamp >= '2015-04-01'

...for the release date?

I've added a release table that stores freeze_timestamp and release_timestamp - essentially latest will be the most recent entry with a non-null release_timestamp (so we can add a frozen release without it moving to latest).

I've adjusted the SQL and it seems to work, but there are difference due to our official "release date" of v4.1. I had July this year - I would happily be told it was earlier but I'm not sure about 3 months after we froze.

??

tonyelewis commented 7 years ago

@tonyelewis was there a reason[...]?

@sillitoe Yes. But contrary to the hopes of Boyzone/The Osmonds, the reason was not love. Instead, it was that when I ran the SQL that I put in the a brief note I added to the "Release Protocol" Trac page (mentioned above), the numbers of domain assignments for the months in the relevant period looked like this:

[...]
 2016-07-01 |  4687
 2016-06-01 |  2310
 2016-05-01 |   881
 2016-04-01 |   741
 2016-03-01 | 48623
 2016-02-01 |     0
 2016-01-01 |  3146
 2015-12-01 | 27888
 2015-11-01 |  4558
 2015-10-01 |  1207
 2015-09-01 |  3308
 2015-08-01 | 10043
 2015-07-01 |  1485
 2015-06-01 |    92
 2015-05-01 |     0
 2015-04-01 |     0
 2015-03-01 |     0
 2015-02-01 |     0
 2015-01-01 |  5700
 2014-12-01 |   730
[...]

...so I went for somewhere in the middle of that patch of zeroes.

2015-01-01 wouldn't do because it would exclude all the v4_1_0 domains that were assigned whilst we let the frozen PDB flush through the system.

Anytime after 2015-06-01 wouldn't do because then it would include all domains were assigned to cathdb_current after we'd branched off the release.

Does that :nose: right to you?

sillitoe commented 7 years ago

Hmm okay, so we're saying apr is the freeze date (rather than the release date)?

I was going to use the release date to give the official "latest" release.

I suppose I can select "latest" on the release date, but then use the freeze date (Apr 2015) for the calculation of CATH-B...

tonyelewis commented 7 years ago

FWIW, I think there are three different dates worth recording here. I agree with what you say about release date. I also think it's worth recording the date of the PDB snapshot on which the CATH release is built (eg, here: 2015-01-01).

Maybe something like (in chronological order):

?

sillitoe commented 7 years ago

any chance you could rerun that query and let me know the pdb_snapshot_date and assignment_cutoff_date of any of the releases before v4.1?

tonyelewis commented 7 years ago

Yep. NP. I can do assignment_cutoff_dates back to v3.0:

v4.1.0 2015-04-01
v4.0.0 2013-07-01
v3.5.0 2011-10-01
v3.4.0 2011-01-01
v3.3.0 2009-06-01
v3.2.0 2008-07-01
v3.1.0 2007-02-01
v3.0.0 2006-05-01

I'm not sure this data helps with pdb_snapshot_date but then you've got those (going back to v3.0) on here: http://www.cathdb.info/wiki/doku/?id=release_notes


Full data for reference...

(note v3.2.0 isn't against a 0 but I checked that one at the week level - 2008-07-01 is OK)

 2016-12-01 |   501
 2016-11-01 |  3327
 2016-10-01 |  3518
 2016-09-01 | 12733
 2016-08-01 |     0
 2016-07-01 |  4687
 2016-06-01 |  2310
 2016-05-01 |   881
 2016-04-01 |   741
 2016-03-01 | 48623
 2016-02-01 |     0
 2016-01-01 |  3146
 2015-12-01 | 27888
 2015-11-01 |  4558
 2015-10-01 |  1207
 2015-09-01 |  3308
 2015-08-01 | 10043
 2015-07-01 |  1485
 2015-06-01 |    92
 2015-05-01 |     0
 2015-04-01 |     0     v4.1.0
 2015-03-01 |     0
 2015-02-01 |     0
 2015-01-01 |  5700
 2014-12-01 |   730
 2014-11-01 |  3153
 2014-10-01 |  1753
 2014-09-01 |  3120
 2014-08-01 |  2369
 2014-07-01 |  1375
 2014-06-01 |  1355
 2014-05-01 |  3473
 2014-04-01 |  6926
 2014-03-01 |  7242
 2014-02-01 |  3642
 2014-01-01 |  1328
 2013-12-01 |  1876
 2013-11-01 |  6121
 2013-10-01 | 23470
 2013-09-01 |     0
 2013-08-01 |     0
 2013-07-01 |     0     v4.0.0
 2013-06-01 |     1
 2013-05-01 |   846
 2013-04-01 | 13009
 2013-03-01 |  1147
 2013-02-01 | 13981
 2013-01-01 |     0
 2012-12-01 |     0
 2012-11-01 | 17266
 2012-10-01 |   165
 2012-09-01 |   251
 2012-08-01 |   220
 2012-07-01 | 15017
 2012-06-01 |    67
 2012-05-01 |     0
 2012-04-01 |    22
 2012-03-01 |    39
 2012-02-01 |   142
 2012-01-01 |    42
 2011-12-01 |    79
 2011-11-01 |    43
 2011-10-01 |     0     v3.5.0
 2011-09-01 |     0
 2011-08-01 |  2205
 2011-07-01 |  1770
 2011-06-01 |  1025
 2011-05-01 |  1514
 2011-04-01 |  1605
 2011-03-01 |  9372
 2011-02-01 |  3162
 2011-01-01 |     0     v3.4.0
 2010-12-01 |     0
 2010-11-01 |     0
 2010-10-01 |   450
 2010-09-01 |  1616
 2010-08-01 |  2273
 2010-07-01 |  3026
 2010-06-01 |  1823
 2010-05-01 |  2211
 2010-04-01 |  1668
 2010-03-01 |  2217
 2010-02-01 |  2021
 2010-01-01 |  1300
 2009-12-01 |  2098
 2009-11-01 |   688
 2009-10-01 |   552
 2009-09-01 |  1561
 2009-08-01 |   833
 2009-07-01 |    30
 2009-06-01 |     0     v3.3.0
 2009-05-01 |   249
 2009-04-01 |   988
 2009-03-01 |  3030
 2009-02-01 |   360
 2009-01-01 |  1396
 2008-12-01 |  1955
 2008-11-01 |  1176
 2008-10-01 |  1022
 2008-09-01 |  1013
 2008-08-01 |  1548
 2008-07-01 |  1728     v3.2.0
 2008-06-01 |     2
 2008-05-01 |   750
 2008-04-01 |  1562
 2008-03-01 |  1136
 2008-02-01 |  1755
 2008-01-01 |   575
 2007-12-01 |  1585
 2007-11-01 |  3011
 2007-10-01 |     0
 2007-09-01 |   244
 2007-08-01 |  7680
 2007-07-01 |  1771
 2007-06-01 |   297
 2007-05-01 |    12
 2007-04-01 |     0
 2007-03-01 |     0
 2007-02-01 |     0     v3.1.0
 2007-01-01 |  4236
 2006-12-01 |  2487
 2006-11-01 |  5934
 2006-10-01 |     0
 2006-09-01 |     0
 2006-08-01 |   188
 2006-07-01 |     0
 2006-06-01 |     0
 2006-05-01 |     0     v3.0.0
 2006-04-01 | 16568
 2006-03-01 |     0
 2006-02-01 |     0
 2006-01-01 |     0
sillitoe commented 7 years ago

Actually don't worry too much about this. Based on the v4,0 milestone

e.g.

The CATH Update system may flush/cluster itself before 10th May so we may be able to pull this schedule forward a little.

I'm calling the assignment_cutoff_date of v4.0 2015-05-10.

Any dates before that might be handy, but far from essential.

tonyelewis commented 7 years ago

I'm calling the assignment_cutoff_date of v4.0 2015-05-10.

Do you mean 2013? It looks like there were a few more assignments after that, eg 47 in the week of 2013-05-27. Maybe due to running behind schedule? Or late fixes?

Anyway, I'd recommend the dates in my previous comment.

sillitoe commented 7 years ago

Hmmm, currently that's giving me the following table...

INSERT INTO release 
  ( cath_version, pdb_snapshot_date, assignment_cutoff_date, release_date ) VALUES
  ( 'v2.6.0', NULL,         NULL,         '2005-03-16' ),
  ( 'v3.0.0', '2006-05-05', '2006-05-01', '2006-05-15' ),
  ( 'v3.1.0', '2007-01-14', '2007-02-01', '2007-01-14' ),
  ( 'v3.2.0', '2008-07-14', '2008-07-01', '2008-07-14' ),
  ( 'v3.3.0', '2009-07-07', '2009-06-01', '2009-07-07' ),
  ( 'v3.4.0', '2010-11-13', '2011-01-01', '2011-01-01' ),
  ( 'v3.5.0', '2011-09-20', '2011-10-01', '2011-10-14' ),
  ( 'v4.0.0', '2013-03-26', '2015-05-10', '2013-07-01' ),
  ( 'v4.1.0', '2015-01-01', '2015-04-01', '2016-07-01' )
  ;

I think some of the old "PDB release dates" may actually have got mixed up with web site release date. I'm not going to spend too much more time sorting old dates for the archive - more important that we use this moving forward.

The only entry I'll need to adjust is:

  ( 'v3.1.0', '2007-01-14', '2007-02-01', '2007-01-14' ),

to:

  ( 'v3.1.0', '2007-01-14', '2007-01-04', '2007-01-14' ),

??

sillitoe commented 7 years ago

I've updated the SQL to use this table and checked that the results match exactly with the SQL in svn source.

SQL could probably be neater, but it works.

http://trac.cathdb.info/projects/cath/changeset/18617

Closing the ticket - thanks @tonyelewis

tonyelewis commented 7 years ago

Cracking.

Are the last two dates (in different years) for v4.0.0 out of order?

Anyway, great work.

Edit: correction v4.1.0 -> v4.0.0

sillitoe commented 7 years ago

No, just wrong - but already fixed. Thanks

On Fri, 2 Dec 2016, 15:48 Tony E Lewis, notifications@github.com wrote:

Cracking.

Are the last two dates (in different years) for v4.1.0 out of order?

Anyway, great work.

— You are receiving this because you modified the open/close state.

Reply to this email directly, view it on GitHub https://github.com/UCLOrengoGroup/cath-todo/issues/16#issuecomment-264485416, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJVeld0kAfWzJ2SrZLkoNaTNiaVJQz6ks5rED2wgaJpZM4K__UC .