cov-lineages / pangolin

Software package for assigning SARS-CoV-2 genome sequences to global lineages.
GNU General Public License v3.0
427 stars 107 forks source link

Reason for status=fail / Lineage=None #57

Closed tseemann closed 4 years ago

tseemann commented 4 years ago

Was wondering what the main reason for the following fail situation might be?

It's only 8 SNPs and 0 indels from WUHAN-1 so was a bit surprised.

taxon,lineage,SH-alrt,UFbootstrap,lineages_version,status,note
XXXX,None,0,0,2020-04-27,fail,N_content:0.78
A       T       C       G       N       K       M       R       W       Y
8229    8852    5048    5401    2341    4       4       4       2       18

I notice N_content:0.78 is not a proportion or a percentage - out by factor of 10 ? It should be 7.8% N

2341/29903 = .07828645955255325552

https://github.com/hCoV-2019/pangolin/blob/ae8cf33001a7df33048e74c027562d2d9fcbee6b/pangolin/command.py#L99-L105

tseemann commented 4 years ago

Should it be len(record) instead of len(record.seq) ?

https://biopython.org/DIST/docs/api/Bio.SeqRecord.SeqRecord-class.html

aineniamh commented 4 years ago

To check your issue I have created a sequence with the exact same composition as your one above (it's in pangolin/test/test_seqs.fasta) and it passes the qc step (lineage_report.csv output shown below). Running the block of code that calculates proportion of N manually on it it prints out an N content of 0.08.

I'm not sure what's going on that you got N_content:0.78, I haven't changed any of the code. I wonder if it's something to do with gaps? But that should only lower N content, not inflate it.

taxon lineage SH-alrt UFbootstrap lineages_version status note
issue_57_torsten_seq B.1.12 100 91 release_2020-04-27 passed_qc
This_seq_has_6000_Ns_in_18000_bases B.1 100 100 release_2020-04-27 passed_qc
Japan/DP0779/2020 B 100 100 release_2020-04-27 passed_qc
USA/NY-NYUMC7/2020 B.1 100 100 release_2020-04-27 passed_qc
This_seq_has_no_seq None 0 0 release_2020-04-27 fail seq_len:0
This_seq_is_too_short None 0 0 release_2020-04-27 fail seq_len:2997
This_seq_has_lots_of_Ns None 0 0 release_2020-04-27 fail N_content:0.98
This_seq_is_literally_just_N None 0 0 release_2020-04-27 fail N_content:1.0

Should it be len(record) instead of len(record.seq)

That shouldn't make a difference, yes it would be more straighforward and I'll change it, but len(record) == len(record.seq).

num_N = str(record.seq).upper().count("N") 
prop_N = round((num_N)/len(record.seq), 2) 

and

num_N = str(record.seq).upper().count("N") 
prop_N = round((num_N)/len(record), 2) 

give the same answer.

aineniamh commented 4 years ago

Can you send me your sequence for debugging purposes and I'll look into it further.

aineniamh commented 4 years ago

Is it just that particular sequence reporting the weird value or is it across all query sequences in that particular file?

aineniamh commented 4 years ago

Hey @tseemann, I think I'll close this issue now as we can't seem to replicate it. If you ever come across it again, happy to reopen.

tseemann commented 4 years ago

@aineniamh sorry yes i haven't encountered it again but i did upgrade to git master regularly. it was weird had us flummoxed for a bit. closing is good.