HCGB-IGTP / XICRA

Small RNAseq pipeline for paired-end reads
MIT License
7 stars 3 forks source link

Duplicates from miraligner #45

Open jonahcullen opened 1 year ago

jonahcullen commented 1 year ago

Hello, I had a quick question/clarification re: duplicate miRNA IDs. I noticed in the duplicates matrix there were a large number of miRNAs called duplicates due to the license plate but when you look at the annotation it seems to be due to the ordering (eg iso_snv, iso_3p:-1 vs iso_3p:-1,iso_snv). Should these actually be considered different miRNAs? When I looked at a couple example ones it appears that half of the samples would be listed in one order and the other half listed in a different order, both with the same license plate. I wrote some code to adjust for that which resulted in 0 duplicates from the inputs.

JFsanchezherrero commented 1 year ago

Hi there,

We decided to discard these isomiRs as it be potential mistakes or real duplicates. We trust in miraligner annotation and/or miRTop and we haven't further investigated.

In some real data we have analyzed we always find that these isomiRs have spurious counts so we decided to discard them but it might be worth to further check them or take them into account. In the pipeline we generate this duplicated matrix so that the final user can decide what to do about them.

Do you have any examples on how to solve this issue? It would be great to have a look.

Best regards

jonahcullen commented 1 year ago

Right okay. Perhaps I am just confused about how the ordering of the annotation given the same UID could be spurious? I started digging into this as I found 2x as many duplicates compared to non-duplicates. I don't have a number but the vast majority of those duplicates are 0 counts but some are kinda high.

I modified the isomir matrix generator function by including

isomirs = [
    'iso_3p', 'iso_add3p', 'iso_5p', 'iso_add5p', 'iso_snv',
    'iso_snv_central', 'iso_snv_central_offset', 'iso_snv_central_supp',
    'iso_snv_seed', 'NA'
]

# function extract key from each element
def sort_key(element):
    return isomirs.index(element.split(':')[0])

and then instead of data['unique_id'] = data.apply(lambda data: data['miRNA'] + '&' + data['Variant'] + '&' + data['UID'], axis=1)

data['unique_id'] = data.apply(lambda data:
    data['miRNA']
    + '&'
    + ','.join(sorted(data['Variant'].split(','), key=sort_key))
    + '&'
    + data['UID'],
    axis=1
)