COG-UK / dipi-group

Data integrity and pipeline integration working group
4 stars 1 forks source link

Metadata file seems incomplete #172

Closed Jeltje closed 2 years ago

Jeltje commented 2 years ago

In a download we did yesterday, I found 160,916 sample IDs in the fasta file that are not in the metadata file. I have seen discrepancies like this for the past week so it doesn't look like a one time glitch.

This happened in October as well, don't know if it's a similar issue.

rmcolq commented 2 years ago

The metadata file is matched to the aligned fasta. Are you comparing with the file of all sequences?

rmcolq commented 2 years ago

I count 1439105 sequences in the aligned fasta and 1439105 non-header rows in the metadata. There are 1601285 sequences in the full fasta - the difference are the sequences which fail our QC (but are made publicly available for completeness/openness). I think this explains your missing numbers.

SamStudio8 commented 2 years ago

Judging by the :+1: here I think this has been solved!