OCHA-DAP / hdx-ckan

A repo for HDX's configurations and extensions to CKAN
Other
75 stars 24 forks source link

Garbage in ckan db. #666

Open teodorescuserban opened 10 years ago

teodorescuserban commented 10 years ago

Working on caching, i listed on stag the cps resources published on ckan and I noticed there are quite a few ckan resources pointing to the old cps urls (ones having hdx-1.0.0 instead of hdx).

Please pm on skype to get more details.

cjhendrix commented 10 years ago

As I understand, Luis/Godfrey are working on these? Is there a data team issue, if so can we close this one?

teodorescuserban commented 10 years ago

@luiscape please comment on it.

I would leave it open, unless there is a specific repo for data team issues. :)

luiscape commented 10 years ago

@teodorescuserban Interesting. Would you know what is the process of registering resources to CKAN? Is it done at the CPS-level or is there another script running on its own somewhere else?

teodorescuserban commented 10 years ago

There were 2 ways to input data into prod ckan so far.

  1. through api:
    • ckan setup script made by @aalecs ran several times (the resources I mentioned are most likely created by this script)
    • David got the airports locations
  2. through normal interface
    • you guys were supposed to add / validate / delete some datasets - at least @cjhendrix and @amcguire62 seems to think that.

As far as I know there is not yet a way programmed to publish from cps resources on ckan.

cjhendrix commented 10 years ago

That is my understanding as well.

takavarasha commented 10 years ago

Talked to luis about this. We did a CKAN api search on stag and prod and did not find any datasets whose url contain "hdx-"

luiscape commented 10 years ago

Just complementing Godfrey's response and closing.

We searched using http://data.hdx.rwlabs.org/api/action/resource_search?query=url:hdx-1.0.0. The output is:

result: {
count: 0,
results: [ ]
}

When we search with http://data.hdx.rwlabs.org/api/action/resource_search?query=url:hdx the output is:

result: {
count: 2713,
results: [ ]
}

Even searching with cntr + F on the latter query no hdx-1.0.0 was found.

Closing.

cjhendrix commented 10 years ago

@luiscape You reopened this. What's the latest?

luiscape commented 10 years ago

Ops. I don't remember re-opening. Sleepwalking, I assume.

Closing.

teodorescuserban commented 10 years ago

Still not solved, but I guess Iwill just get rid of some of trash with @cjhendrix assistnce and approval. :)

cjhendrix commented 10 years ago

Serban, please paste the list here. I want to be doubly sure they aren't visible anywhere in CKAN. Then we can try to figure out why they are still in the db (deleted/private items, I'm guessing), and talk about deleting.

--cj

cjhendrix commented 10 years ago

@teodorescuserban Is this still an issue?

teodorescuserban commented 9 years ago

Sorry, it looks like this one rscpaed my eye somehow. Will check tomorrow and reply, @cjhendrix

cjhendrix commented 9 years ago

I did a bit more investigation on this. What I want to avoid is deleting anything from the database that may still be part of a dataset, even if the dataset is deleted or private or if it's part of an old revision.

Some of these garbage URLs are definitely used in deleted datasets:

However, in Serban's query result, this url is listed as active (not deleted, like some of them).

Serban, I do think I need a better table like you described in order to better troubleshoot this. Could you query out: Dataset package id package_name private state revision id Resource resource name resource revision id

These things have been in there for a while and don't seem to be causing problems, so there is no rush on this.

teodorescuserban commented 9 years ago

Please move it to next sprint on monday if i cant make it.

teodorescuserban commented 9 years ago

Still no time for that one...

teodorescuserban commented 9 years ago

query used:

select r.id as r_id, r.name as r_name, r.url as r_url, r.state as r_state, r.revision_id as r_rev_id, p.id as p_id, p.name as p_name, p.url as p_url, p.state as p_state, p.private as p_private, p.revision_id as p_rev_id from resource as r, resource_group as g, package as p where r.url like '%hdx-1.0.0%' and r.resource_group_id = g.id and g.package_id = p.id;

results will come by email in a few minutes.

danmihaila commented 9 years ago

@teodorescuserban any update about this?

cjhendrix commented 9 years ago

I've got the email. Reassigning to me.

danmihaila commented 8 years ago

@cjhendrix is this a valid issue anymore?

cjhendrix commented 8 years ago

I suspect the issue still exists, but whether or not it is important is for @teodorescuserban to say.

danmihaila commented 8 years ago

@teodorescuserban please take a look and comment