ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
250 stars 32 forks source link

Summary file version inconsistencies #240

Open victorlin opened 3 years ago

victorlin commented 3 years ago

Examples for ERR2756788.

Original summary header line [S3 link]:

SUMZER_COMMENT=sra=ERR2756788,genome=cov3ma,date=200607-01:47;

New summary header line [S3 link]:

readlength=150;SUMZER_COMMENT=sra=ERR2756788,genome=cov3ma,version=200818,date=200817-21:05;

New psummary contents [S3 link]:

sra=ERR2756788;SUMZER_COMMENT=sra=ERR2756788,genome=protref5,date=200831-02:23,type=protein;totalalns=77449;readlength=141;truncated=no;
sra=ERR2756788;famcvg=AAUWAWAAUAAAWWAAOAAWAAAAO;fam=Coronaviridae;score=100;pctid=71;alns=16477;avgcols=47;
sra=ERR2756788;famcvg=auwa_aoa_awwwu_aoowmmamo_;fam=Dicistroviridae;score=100;pctid=68;alns=663;avgcols=47;
...
sra=ERR2756788;gencvg=_.___.wwoomUooUUWmwwaaoa:;gen=Coronaviridae.S;score=100;pctid=66;alns=1607;avgcols=45;
sra=ERR2756788;gencvg=AWmUAWWAmAWAUUWWAUAWAAAAU;gen=Coronaviridae._prot1;score=100;pctid=73;alns=11375;avgcols=48;

2 questions:

  1. Can the new summary header line be arranged to start with SUMZER_COMMENT= as it was originally?
  2. For the new psummary, can the sra=ERR2756788; be removed from the beginning of every line?

I know these files have already been uploaded, so this is more a note for any future reprocessing.

ababaian commented 3 years ago
  1. I think this is just a straight bug, the first line should be starting with SUMZER_COMMENT=, I totally agree

  2. Is something me and @rcedgar have argued about. I disliked the sra=XXXX on every line quite a bit originally as it looks ugly, but in practice it's incredibly pragmatic since we grep these files very often for spot checking and development. If someone were to work with the summary files in bulk I think the same point is true there, it's very useful to have the sra= on each line. It solves some ugly problems with working with millions of files on a linux file-system. I'd opt to retain it.

victorlin commented 3 years ago

Good point about the grep. Would it be equally beneficial to have the sra=XXXX for nucleotide summary files as well? That way it's more consistent.

rcedgar commented 3 years ago

Yes, equally beneficial.