Minor issues with search and sparql

cldi / CanLink

Contains code and tools used to public the Canadian thesis list.

http://canlink.library.ualberta.ca/

3 stars 1 forks source link

Minor issues with search and sparql #41

Open sfarnel opened 6 years ago

sfarnel commented 6 years ago

Seems some data may be missing on the server:

From search: Not Found

The requested URL /Person/6f3654711da669ee4963ac6c8447a618 was not found on this server.

Canned SPARQL query returns no data

rwarren2 commented 6 years ago

Old data. Waiting on @maharshmellow to signal that all the updates have been done before wiping database and restarting.

sfarnel commented 6 years ago

👍

On Mon, Aug 28, 2017 at 8:44 PM, rwarren2 notifications@github.com wrote:

Old data. Waiting on @maharshmellow https://github.com/maharshmellow to signal that all the updates have been done before wiping database and restarting.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cldi/CanLink/issues/41#issuecomment-325538892, or mute the thread https://github.com/notifications/unsubscribe-auth/AEevTM1bpxInsNeSnsWb9eLlMHX4nALbks5sc3sJgaJpZM4PEsNr .

-- Sharon Farnel

Metadata Coordinator, University of Alberta Libraries sharon.farnel@ualberta.ca | 780-492-3685

The University of Alberta is situated on traditional Treaty 6 territory and homeland of the Métis peoples. Amiskwaciwâskahikan / ᐊᒥᐢᑲᐧᒋᕀᐋᐧᐢᑲᐦᐃᑲᐣ / Edmonton

maharshmellow commented 6 years ago

I just need to put a few slow processes into the multiprocessing loop so I'll let you know when I'm done that.

rwarren2 commented 6 years ago

Let's try and get a full blow test in today?

rwarren2 commented 6 years ago

Which slow processes are these @maharshmellow ?

maharshmellow commented 6 years ago

The pdf urls and the number of pages per pdf. It should be working now. I am now trying to run all the files that I have to make sure that they all process correctly but some of them are fairly large so its taking a while. Does this file look good? d9fd0f621b8110cfd6c9b5b3c2563ff7.xml.zip

rwarren2 commented 6 years ago

We were going to put those processes outside of the web application and into cron jobs to minimize the impact on wait time.

Do these run independently on the submission process?

maharshmellow commented 6 years ago

Ohh right! I currently run them during the upload process. Should I make it so that it stores the rest of the data locally into some folder and then the cron job can run another python file that can process the pdf data and put the final output into /tmp?

rwarren2 commented 6 years ago

Going to load your data directly to check a few other things, but it looks good so far.

Since this is all "nice to have" I would like to have it run as a separate cron job from whatever is in the sparql server.

For the page numbers, write a sparql query that finds thesis without pages but with a pdf and iterate through them.
For the PDF's, since they are derived from the original manifestation PDF, also run it off the sparql store by looking for thesis without a pdf link?

maharshmellow commented 6 years ago

ok sounds good!