flass / genomescans

R module and script for the analysis of aligned biological sequences, mainly to perform genome-wide scan for the detection of hotspots of diversity, LD, etc.
GNU General Public License v3.0
5 stars 1 forks source link

Error in strsplit, non-character argument #1

Closed jamiethompson77 closed 4 years ago

jamiethompson77 commented 4 years ago

I ran this code: ./genome-wide_localLD_scan.r -a core_alignment.fasta -o '/home/ubuntu/genomescans/output' --LD.metric=Fisher -f '/home/ubuntu/genomescans/pangenome_alignment.gff' -T 4 -K 4900

And retrieved this error: Error in strsplit(opt$excl.ref.label, split = ",") : non-character argument Execution halted

Could you advise what I'm doing wrong?

flass commented 4 years ago

Hi, it's because you were not specifying a reference genome (can be done using -r|--excl.ref.label option). It is used only as a reference for the site coordinates; it is not essential though, so I made it non mandatory in last commit c51b6f7. I see you're using this script on a pangenome alignment, so maybe you don't have a reference genome that would have most the sites present; by default the 1st row in the alignment will serve as a reference (but won't be excluded from analyses, unlike if you specified it via -r).

A few notes about using this for analysis of a pangenome (so I assume including accessory gene loci): I would be very cautious to interpret variations in LD levels across gene loci where the number of represented genomes vary. This script has been designed to deal with alignments of reference-mapped assemblies, and even though it is a bit flexible and allows for some localised indels here and there, the idea is that the sequence coverage should be homogeneous and close to full coverage; otherwise you have bias in the estimation of LD statistics, and you won't even know where does the signal come from. Aslo, I see you use -K 490: -K stands for crazy, because it needs many many plots already to cover the whole matrix for a just a viral genome; I can't think how many plots you'd need to fit the whole pangenome matrix. As a a rule of thumb you can plot max 200-bp square matrices per page; so divide the length of your pangenome alignment by 200, put it to the square, and you got you required number of pages. It comes without saying that R would take absolutely forever plotting all that. I thus strongly recommend you forget about graphical output (anyway it would be a total pain to navigate) and just rely on the table output. Also in general, this is a fairly computationally intensive script, which scales not to well with increasing number of sites. again it was designed for herpesvirus genomes that are of the 100,000 bp scale.

I hope this helps!