MTG / freesound-datasets

A platform for the collaborative creation of open audio collections labeled by humans and based on Freesound content.
https://annotator.freesound.org/
GNU Affero General Public License v3.0
135 stars 11 forks source link

compute_priority_score_candidate_annotations command makes production server not responding #156

Closed xavierfav closed 6 years ago

xavierfav commented 6 years ago

It was observed in the last days that running the management command _compute_priority_score_candidateannotations from the datasets Django app was the last straw that breaks the camel's back and made the asplab-web1 server out of access.

The commands runs a celery task that basically iterates through all the candidate annotations in the FSD dataset (~700k), calculate their priority score, and update the score stored in database. You can find details about the priority score calculation in #133.

Here is the code of the celery task (datasets/tasks.py):

@shared_task
def compute_priority_score_candidate_annotations():
    logger.info('Start computing priority score of candidate annotations')
    dataset = Dataset.objects.get(short_name='fsd')
    candidate_annotations = dataset.candidate_annotations.filter(ground_truth=None)
    num_annotations = candidate_annotations.count()
    count = 0
    # Iterate all the sounds in chunks so we can do all transactions of a chunk atomically
    for chunk in chunks(list(candidate_annotations), 500):
        sys.stdout.write('\rUpdating priority score of candidate annotation %i of %i (%.2f%%)'
                         % (count + 1, num_annotations, 100.0 * (count + 1) / num_annotations))
        sys.stdout.flush()
        with transaction.atomic():
            for candidate_annotation in chunk:
                count += 1
                candidate_annotation.priority_score = candidate_annotation.return_priority_score()
                candidate_annotation.save()
    logger.info('Finished computing priority score of candidate annotations')

Candidate annotation method for the calculation of the priority score (datasets/models.py):

def return_priority_score(self):
    sound = self.sound_dataset.sound
    sound_duration = sound.extra_data['duration']
    if not 0.3 <= sound_duration <= 30:
        return self.votes.count()
    else:
        duration_score = 3 if sound_duration <= 10 else 2 if sound_duration <= 20 else 1
        num_gt_same_sound = self.sound_dataset.ground_truth_annotations.filter(from_propagation=False).count()
        return 1000 * self.votes.exclude(test='FA').filter(vote__in=('1', '0.5')).count()\
             +  100 * duration_score\
             +        num_gt_same_sound

@alastair, do you thing there is something wrong, or that we could improve?

xavierfav commented 6 years ago

As already discussed a bit, it is important to avoid a to high number of queries. Django stores the performed queries in the django.db.connection.queries list.

Adding a select_related for querying the sound instances (to get their duration), already divides the number of queries by 2.

candidate_annotations = dataset.candidate_annotations.filter(ground_truth=None)\
                               .select_related('sound_dataset__sound')
xavierfav commented 6 years ago

An aggregation was also added in order to calculate the number of present votes from the first query. This allows to remove the queries that were done for each single candidate annotations.

Now the task seems to be faster and less DB consuming.

alastair commented 6 years ago

It would be nice to include issue numbers in commit messages (or open a pull request) so that we can track the improvements made to this problem.

xavierfav commented 6 years ago

Since issue number was not provided in the merge commit message, I add the commits here: a15a83a63bfb193c610febf9a64b1a6e09d54397 a73e28d502cf399fb62d012cf328d2fc0c614e28 06c7330e14158142836c858b695260a70de71801 8d509d8e2196c4f11e24cde14ea1501d0f56e3ff