ChristopherLucas / transcribeR

R Package for Automated Speech Recognition
10 stars 4 forks source link

Job management #2

Closed ChristopherLucas closed 9 years ago

ChristopherLucas commented 9 years ago

At present, sendAudio() writes out a csv that contains the job IDs. You can add new jobs by passing the CSV back through sendAudio.

1) If given the same directory, sendAudio will reupload the same files but not with new job IDs. Let's add an argument to sendAudio that makes the default to not reupload files with the same file name. Something like reupload = FALSE.

Then, the CSV written by sendAudio() goes into retrieveText(), which creates a data frame with a new column. That dataframe can go back into sendAudio() (in the event that some jobs aren't finished, it will fill them in without re-GETing all other jobs). However, now it has a new column, and when written, it might not be able to go back into sendAudio. This is because this line in sendAudio isn't very good.

if(any(colnames(existing.job.csv) != c("NAMES","jobIDs","lang"))){

So...

2) We need a smarter way to identify a transcribeR csv, which will identify a csv and appropriately append to it at any stage in the workflow (first time through sendAudio, second time through, back to sendAudio after retrieveText, etc).

TmscanlanBoston commented 9 years ago

Do you mean you want to avoid having to get the transcription results for audio files that have already been transcribed, and you want to avoid sending files that already have jobID's?

TmscanlanBoston commented 9 years ago

It is possible to return Null (empty string, "") rather than the 'status'. Perhaps a time stamp alerting when the function should be called again?

ChristopherLucas commented 9 years ago

The current function just tries job IDs that were "queuing" at the time of the last call, so that actually works well. The fragile part is just the reliance on the CSV.

Probably the best way to do it is to just make retrieveText() smarter as it reads the csv (e.g., use more columns and have it subset to the ones it needs based on the existing columns or something).

TmscanlanBoston commented 9 years ago

Oh by the way, I was wrong the affect on performance will be barely noticeable if you have to read and write the CSV again.

ChristopherLucas commented 9 years ago

Interesting, thanks. Still, you were right about the design of the dataframe/table. Whatever the type, we should just create it with all eventual columns (including transcribed.text). That solves the second task in this issue and makes passing CSVs between sendAudio() and retrieveText() trivial. Just do whatever is simplest here.

As we discussed, the first task in this issue can be solved by storing full filenames and then grep'ing the new files to determine which should be skipped.